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Abstract. We describe some of the machinery behind recent progress in establish- 
ing infinitely many arithmetic progressions of length k in various sets of integers, in 
particular in arbitrary dense subsets of the integers, and in the primes. 



1. Introduction 
A celebrated theorem of Roth 361 in 1953 asserts: 



Theorem 1.1 (Roth's theorem, first version). [36J Let A C Z + be a subset of inte- 
gers with positive upper density, thus limsup^^^ -k\A H [1, iV] | > 0. Then A contains 
infinitely many arithmetic progressions n,n + r,n + 2r of length three. 

Here we of course restrict the spacing of the progression r to be non-negative. This 
theorem was originally proven by Roth by Fourier analytic methods and a stopping time 
argment, and we shall reprove it below (in fact, we shall give two proofs). This theorem 
was then generalized substantially by Szemeredi in 1975: 

Theorem 1.2 (Szemeredi's theorem, first version). |38j . |3*9"j Let A C Z + be a subset of 
integers with positive upper density, thus limsup^^^ -k\A fl [1, TV] | > ; and let k ^ 3. 
Then A contains infinitely many arithmetic progressions n,n + r, . . . ,n + (k — l)r of 
length k. 

Thus Roth's theorem is the k = 3 version of Szemeredi's theorem. (The cases k < 3 
are trivial). 

Szemeredi's original proof was combinatorial (relying in particular on graph theory) 
and very complicated. A substantially shorter proof - but one involving the full machin- 
ery of measure theory and ergodic theory, as well as the axiom of choice - was obtained 
by Furstenberg jTU], ^1] in 1977. Since then, there have been two other types of proofs; 
a proof of Gowers [T2], JZI m 2001 which combines "higher order" Fourier analytic 
methods with techniques from additive combinatorics; and also arguments of Gowers 
[18] and Rodl-Skokan [HI] , jSHj using the machinery of hypergraphs. While we will not 
discuss all these separate proofs in detail here, we will need to discuss certain ideas from 
each of these arguments as they will eventually be used in the proof of Theorem 15.11 
below. 

The above theorems do not apply directly to the set of prime numbers, as they 
have density zero. Nevertheless, in 1939 van der Corput [13] proved, by using Fourier 
analytic methods (the Hardy-Littlewood circle method) which were somewhat similar 
to the methods used by Roth, the following result: 
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Theorem 1.3 (Van der Corput's theorem). [32] Let P C Z + be the set of primes. Then 
P contains infinitely many arithmetic progressions n,n + r,n + 2r of length three. 

However, just as Roth's Fourier-analytic methods proved very difficult to extend 
beyond the k = 3 case, so too did van der Corput's arguments. The proof relied on very 
delicate information concerning the Fourier coefficients of the primes (or more precisely 
of the von Mangoldt function A(n), which is essentially supported on the primes). 
This additional information allows one to not only show that there are infinitely many 
progressions of primes of length three, but also to obtain an asymptotic count as to how 
many such progressions there are; we shall return to this point later. 

Roth's theorem and van der Corput's theorem were combined by Green j20j in 2003 
to obtain 

Theorem 1.4 (Green's theorem). [20] Let A C P be a subset of primes with positive 
relative upper density: 

\An[l,N]\ 

limSUp — : —rr > 0. 

Then A contains infinitely many arithmetic progressions n,n + r,n + 2r of length three. 

A key observation made in that paper was that one did not need very deep number- 
theoretic information about the structure of A or P to prove this result. In fact, the 
same result holds not just for relatively dense subsets of primes, but relatively dense 
subsets of almost primes (numbers containing no small prime factors); we shall return 
to this point later. 

In 2004, Ben Green and the author [2B] were able to extend this theorem to arbitrarily 
long progressions, by replacing Fourier-analytic ideas with ergodic theory ones: 

Theorem 1.5. [21] Let A C P be a subset of primes with positive relative upper density: 

\An[l,N]\ 
hmsup — — : — —tj > 0, 

and let k ^ 3. Then A contains infinitely many arithmetic progressions n,n+r, . . . ,n+ 
(k — l)r of length k. In particular, the primes contain arbitrarily long arithmetic pro- 
gressions. 

At the time of writing, we are not able to obtain van der Corput's more precise 
asymptotic estimate on the number of prime progressions of arbitrary length k, but we 
are able to do so in the k = 4 case; see Section El 

In this expository article, we review briefly the methods of proof of Roth's theorem 
and Szemeredi's theorem for various values of k, focusing in particular on the cases k = 3 
and k = 4 which are amenable to Fourier analysis and "quadratic Fourier analysis" 
respectively. Then we discuss the recent extension of these theorems to the prime 
numbers. There is substantial overlap between this survey and [22J. 

2. Progressions of length three 

We now discuss some proofs of Roth's theorem. We first observe that this theorem 
can be reformulated in one of two equivalent "finitary" settings: firstly as a statement 
about subsets of long arithmetic progressions, and secondly as a statement about a large 
cyclic group. 
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We need some notation. The interval [a, b] shall always refer to the discrete interval 
{n G Z : a ^ n ^ b}. We use \A\ to denote the cardinality of a finite set A. If A 
is a finite set and / : A — > C is a complex-valued function, we define the expectation 
E(/) = E(/(n)|n G A) of / to be the quantity 

E(/WI"GA):=l^/(n); 

similarly, if P(n) is a property pertaining to elements of A, we define the probability of 
P to be 

F(A) = F(P(n)\n e A) := —\{n e A : P(n) is true}|, 

and we define lp to be the indicator function of P, thus lp(n) = 1 when P(n) is true 
and lp(n) = otherwise. 

Theorem 2.1 (Roth's theorem, second version). Let < 5 ^ 1. Tnen tnere exzsfo an 
A^o := A^ (5) > 1 suc/i taai, /or any arithmetic progression P C Z o/ length at least 
N and any subset A G P of density P(n 6 A : n 6 P) i contains at least one 

arithmetic progression n,n + r,n + 2r of length three. 

Note that the choice of progression P is unimportant to this theorem; only the length 
is relevant. This is because all progressions of a fixed length are isomorphic to each 
other by an afline scaling map. Thus one could set P = [I, N] here for some N ^ N 
with no loss of generality. 

Henceforth let us call a function / : A — > C on a finite set A bounded if |/(n)| ^ 1 for 
all n G A. 

Theorem 2.2 (Roth's theorem, third version). Let < 5 ^ 1, and let N ^ 1 be a 

prime integer. Let f : Z/iVZ — > [0, 1] fre a non-negative bounded function with large 
mean 

E(f(n)\n G Z/NZ) ^ 5. (2.1) 

T/ien we nave 

E(/(n)/(n + r)/(n + 2r) |n, r G Z/NZ) > c(3, 5) - o 5 (l) (2.2) 

/or some c(3, o") > depending only on 5, where os(l) is a quantity that depends on 5 
and N, and for each fixed S tends to zero as N goes to infinity. 

Before we prove any of these versions, let us first sketch why they are equivalent. 
Proof. [Second version implies first version] Let A be a set of positive upper density. 
Then there exists a 5 > such that \A H [1, N]\ ^ 25 N for infinitely many N. Using 
this, one can find infinitely many disjoint intervals [aj, bj] of length bj — aj ^ iVo(<5) such 
that A has density at least 5 on these intervals: 

P(n G A : n G [a^bj]) ^ 5. 

Applying the second version of Roth's theorem to each such interval we thus see A has 
infinitely many progressions of length 3 as desired. □ 
Proof. [First version implies second version] Suppose for contradiction that the second 
version failed. Then we could find a 5 > and sets Aj C [1, Nj] (with Nj — > oo) with 
P(n E Aj : n E [l,Nj]) ^ o" and with each Aj containing no arithmetic progressions of 
length 3. By refining the sequence if necessary we may assume that the Nj are increasing 
in j (indeed we could make this sequence grow incredibly fast if desired). If one then 
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considers the set A := [_}JL 1 2Nj + Aj, then it is easy to show that A has positive upper 
density but contains no arithmetic progressions, a contradiction. □ 
Proof. [Third version implies second version] Let p be a prime between 2N and 47V 
(which always exists by Bertrand's postulate). Let tc : [1,N] — > Z/pZ be the canonical 
injection of [1, N] into Z/pZ. If A C [1, iV] has density P(n G A : n G [1, N)) > 5, then 
the function / := I^m) on Z/pZ is non-negative, bounded, and obeys the estimate 

E(f(n)\n G Z/pZ) = ^ > j^f ^ 5/4- 

p 4iv 

Thus by the third version of Roth's theorem we have 

E(f(n)f(n + r)f(n + 2r) |n, r G Z/pZ) ^ c(3, 5/4) - o* (1). 

Note that f(n)f(n + r)f(n + 2r) is non-zero only when n = 7r(n'), n + r = 7r(n' + r'), 
n + 2r = 7r(n' + 2r') and ra' G [1, N], —N < r < N, in which case this quantity is equal 
to 1. Thus we have 

\{(ri, r') : ri, ri + r', ri + 2r' G A;ri G [1, iV]; -iV < r' < N}\ ^ c(3, 5/4)p 2 - 0(5 (p 2 ). 

We can discard the r' = terms as they contribute O(N) = o(p 2 ). By symmetry we 
can then reduce to the positive r'. We thus have 

\{{n',r') : ri, ri + r', ri + 2r' G A;ri G [1,JV];0< r' < A^}| > c(3, 5/4)p 2 /2 - o 5 (p 2 ). 

If (and hence p) is sufficiently large, then the right-hand side is non-zero, and we have 
demonstrated the existence of a non-trivial arithmetic progression of length three in A. 
(In fact we have demonstrated ^ c'(3, S)N 2 such progressions for some c'(3, 5) > 0). □ 
Proof. [Second version implies third version] This argument is due to Varnavides (45j . 
We first observe that to prove the theorem, it suffices to do so when / is a characteristic 
function / = 1^. This is because if / is non-negative, bounded and obeys ()2.H) then the 
set A := {n G Z/NZ : f(n) > 8/2} must have density at least P(n G A : n G Z/NZ) ^ 
5/2). Since we have the pointwise bound 1 from below / ^ we have 

5 3 

E(f(n)f(n + r)f(n + 2r)\n,r G Z/iVZ) > — E(l A (n)l A (n + r)l A (n + 2r) \n, r G Z/iVZ) 

8 

and so (|2.2j) for / would follow from ()2.2|) for A (with a slightly worse value of c(3, 5), 
namely f c(3,5/2)). 

It remains to verify (J2.2)) for characteristic functions. Let M = M(5) be a large 
integer depending on S to be chosen later. To prove ()2.2j) it suffices to do so in the case 
N ^> M, since the case iV = O(M) is vacuous. 

The idea is to cover Z/NZ uniformly by progressions P a b := {a + b,a + 2b, . . . , a + Mb} 
of length M, where we allow b to be zero. Indeed we observe that for every n G Z/NZ 
there are exactly NM pairs (a, b) G Z/NZ x Z/NZ such that n G a + [1, M] ■ b (this is 
easiest to see by choosing b first). Thus 

5 ^ P(n G A|n G Z/iVZ) 

= P(n G A|n G P afe ; (a, 6) G Z/iVZ x Z/iVZ) 

= E(P(n G A|n G P a6 |(a, 6) G Z/iVZ x Z/NZ). 



1 This is somewhat crude. A slightly better argument would be to select A randomly, with each 
element n € Z/Nl* having a probability of f(n) to lie in A, and then take averages, but in practice 
this does not yield significantly better constants at the end. 
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In particular, if we let Q C Z/iVZ x Z/NZ be the set of pairs (a, b) such that F(n G 
A\n G P a b) ^ 5/2, then we have 

P((a,6) G Sl\(a,b) G Z/iVZ x Z/NZ) ^ 5/2. (2.3) 

Now choose M := Nq(8/2). From the definition of f2 and the second form of Roth's 
theorem, we see that for every (a, b) G Q, the set AT\P a b contains at least one non-trivial 
arithmetic progression n, n + r, n + 2r of length three. In particular we have 

P(n, n + r, n + 2r G A|n, n + r, n + 2r G P a f>; r- 7^ 0) ^ M' 2 

since the number of progressions n, n + r, n + 2r in P a 6 is at most M 2 . 

Now observe that every progression n,n + r,n + 2r G Z/iVZ with r 7^ is contained in 
exactly the same number of progressions P a b, since they are all isomorphic using affine 
scaling maps (here we use that iV = |Z/iVZ| is prime). Thus we have 

P(n, n + r,n + 2r G A|n, n + r, n + 2r G Z/iVZ; ^ M" 2 

In particular (adding in the r = case) we have 
2 

E(JJl A (a; + jr)|:E G Z/iVZ;r G Z/iVZ) > M~ 2 — o(l) 
i=o 

which gives ()2.2|) as desired (with c(3, 5) = M~ 2 = N (S/2)~ 2 for characteristic func- 
tions, and hence c(3, 5) = y^o(5/4)~ 2 for arbitrary functions). □ 
In light of these equivalent formulations, it is natural to introduce the Lebesgue spaces 
L P (Z/NZ) for 1 ^ p ^ 00, defined as the complex- valued functions on Z/NZ equipped 
with the norm 

\\f\\L^m)--=n\f\ p ) 1/p = {^ E i/wn 1/p 

n€Z/JVZ 

and to introduce the trilinear form A 3 : L Pl (Z/NZ) x L P2 (Z/NZ) x L P3 (Z/NZ) C 
by 

A 3 (f,g,h) := E(f(n)g(n + r)h(n + 2r)\n,r G Z/NZ). (2.4) 

Here we always assume N to be a large prime (in particular, it is odd). Thus the 
third version of Roth's theorem can be reformulated as follows: if / G L°°{Z/NZ) is a 
non-negative function obeying the bounds 

< 5 ^ ||/||H(Z/2VZ) ^ ||/|U°°(Z/7VZ) ^ 1 

then 

A 3 (/,/,/)^c(3,5)-o*(l) (2.5) 

for some c(3, 5) > 0. Note that the task here is to obtain lower bounds on the form 
Asififif) rather than upper bounds, which are considerably easier to obtain. For 
instance, from multilinear interpolation (or Young's inequality) it is easy to establish 
the upper bounds 

\Mf,9,h)\ < imUHMMNU" (2.6) 

whenever 1 ^ p,q,r ^ 00 and - + ~ + ~ ^ 2; here f,g,h are arbitrary complex- valued 
functions. Note that the non-negativity of / and of A 3 (i.e. A 3 (f,g, h) is non-negative 
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whenever /, g, h are non-negative) is crucial, since without this one could not even obtain 
the trivial bound 2 A 3 (/, /, /) > 0, let alone (I23|) . 

At first glance it does not appear that upper bounds such as ()2.6|) are useful for 
proving lower bounds of the type 1)2. 5 J) . However, one can use the multilinearity of 
A 3 to convert upper bounds to lower bounds as follows. Without loss of generality 
we may take E(/) to be equal to 5 (since if E(/) > 5 we may simply decrease / and 
hence A 3 (/, /, /). We decompose 3 / into a "good function" g := E(/) = 5 and a "bad 
function" b := / — E(/), and then we can split A 3 (/, /, /) into eight components: 

A 3 (f,f,f) = A 3 (g,g,g) + ... + A 3 (b,b,b). 

The first term can be computed explicitly, and can be viewed as a main term: 

A 3 (g,g,g)=A 3 (5,8,5) = 5 3 . 

Thus if one can obtain upper bounds on the magnitude of the remaining seven terms 
which add up to less than 5 3 , then one can hope to prove (|2.5[) . The bound ()2.6|) turns 
out to be too weak to do this, unless 5 is very close to 1 (e.g. if 5 > 2/3); however, one 
can do better by replacing the Lebesgue norms with some additional norms, based on 
the Fourier transform 

/(0 := E(f{x)e N {-x£)\x G Z/NZ), 

where e^ : Z/NZ — > S 1 is the character e^ix) := exp(27rix / N) . From the Fourier 
inversion formula 

/(*)= E f(0e N (x0 

we see that 

A 3 (f,g,h) = £ /(6)p(6)/(6)E(e JV « 1 + (n + r)6 + (n + 2r)6)|n,r G Z/pZ). 

The expectation on the right-hand side equals 1 when £i = £ 3 and £2 = — 2£i, and equal 
to zero otherwise. Thus we have the identity 

A 3 (f, g ,h)= E f(0g(-^)k0- 

From the Plancherel identity 

||/||l2(Z/2VZ) = ||/||za( Z /JVZ) 

and Holder's inequality, we thus have the estimate 

\M(f,9,h)\ ^ ||/||i,2( Z /ivz)|b||i,2(z/jvz)||/i||z~(z/jvz) (2.7) 
and similarly for permutations. We also have the variant 

\A 3 (f,g,h)\ ^ \\f\\L2(z/NZ)\\g\\i*(z/NZ)\\h\\i4(z/Nz) (2.8) 
This leads to the following criterion to ensure A 3 (/, /, /) is positive. 

2 There is also the slightly better trivial bound A 3 (/, /, /) > \\f\\L a (z/NZ)/N coming from the r = 
term in $IA\ . but this lower bound is o(l) and is thus not significantly better than the trivial bound 
of 0. 

3 This is of course a very simple decomposition. Later on we shall use more sophisticated decom- 
positions, which can be viewed as "arithmetic" versions of the Calderon-Zygmund decomposition in 
harmonic analysis. 
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Proposition 2.3. Let f G L°°(Z/A^Z) have a decomposition of the form f = g + b, 
where 

IMU°°(z/atz); ||&IU°°(z/ivz) =0(1); \\g\\L°°(z/NZ)> \\b\\L°°(z/NZ) =0(5). (2.9) 
Then we have the estimates 

A 3 (fJJ)=A 3 (g,g,g) + 0(6\\b\\ l ~ imz) ) (2.10) 

and 

A 3 (fJ\f)=As(9,9,9) + 0(5^\\b\\ lHmZ) ). 

Remark 2.4. Interestingly, estimates of this type (after being suitably localized in phase 
space) have proven to be crucial in recent progress in understanding the bilinear Hilbert 
transform (see e.g. [SH^ or at least in understanding the contribution of individual 
"trees" to that transform. Indeed there is some formal similarity between the trilinear 
form A 3 and the trilinear form A(/, g, h) :=p.v.J J f(x + t)g(x — t)h(x)^j^ associated 
to the bilinear Hilbert transform. 

Proof. From the hypotheses we have 

Ibll^z/Arz), ||&||L 2 (z/ra) = 0(5 1/2 ) 

and hence by Plancherel 

||g||z2(z/7VZ), ||&||/ 2 (Z/iVZ) = 0(5 1/2 ). 

On the other hand, from the L l bounds on g and b we have 

||#||z°°(Z/ArZ)> ||fr||«°°(Z/7VZ) = 0(5) 

and so by Holder's inequality 

||<?||z 4 (z/ivz) ? IHIz 4 (z/ra) = 0(<5 3//4 ). 

The claims now follow by decomposing A 3 (/, /, /) into eight pieces as before, setting 
aside A 3 (g,g,g) as a main term, and using ()2.7|) . ()2.8|) (and permutations thereof) to 
estimate all the remaining pieces (which involve at least one copy of b). □ 
This suggests the following strategy: in order to obtain a non-trivial lower bound on 
A 3 (/, /, /), we should obtain a splitting f = g + b obeying the bounds (|2.9j) where the 
"good" function g already has a large value of A 3 (g,g,g) (thus we shall presumably 
want g to be non- negative) , and the "bad" function b has a small Fourier transform, 
either in l°° norm or / 4 norm. Note that up to polynomial factors of S, the two norms 
are somewhat equivalent, as one can easily establish the estimates 

< 11% < il&II^W 2 < ^ 1/4 |l^ 2 - (2-H) 
In the original arguments involving Roth's theorem, the l°° norm on the Fourier coeffi- 
cients was used, but as we shall see later, it is the I 4 norm which is easier to generalize 
to "higher order" Fourier analysis, which will be necessary to treat the k ^ 4 case. Let 
us rather informally call a function b which obeys bounds such as (|2.9J) linearly uniform 
if the Fourier transform b is very small in either l°° or Z 4 ; we see from fl2.11j) that it is not 
terribly important which norm we choose here. The reason for this terminology is that 
a linearly uniform function b is one which is uniformly distributed with respect to linear 
phase functions eAr(x£), in the sense that the inner product of b with such functions is 
small. (This rather vague statement can be made more precise using Weyl's criterion 
for uniform distribution). 
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We have already indicated one such candidate for a decomposition, namely the decom- 
position f = g + b into the expectation g := E(/) and the expectation-free b := / — E(/) 
components of /. Certainly this decomposition obeys the bounds (|2.9|h and the value 
of As(g, g, g) ^ 5 3 is moderately large. However, at this stage we do not have very good 
bounds on ||&||z°o(z/jvz) or ||fr||« 4 (z/ivz); the best bounds we have on these quantities are 
0(5) and 0(<5 3 / 4 ) respectively, and thus the error term can dominate the main term. 
(Indeed, there certainly exist functions / for which As(f,f,f) is significantly differ- 
ent from A 3 (g,g,g); consider for instance / = I^^sn], in which the former quantity is 
comparable to S 2 and the latter is comparable to 5 3 ). 

However, we can at least eliminate one case, in which b — f — E(/) is sufficiently 
linearly uniform (for instance if ||6||oo ^ 5 2 /100). The question is then what to do in 
the remaining cases, when b = f — E(/) is not sufficiently linear uniform. The strategy 
is then to convert the lack of linear uniformity from a liability to an asset, by showing 
that this lack of uniformity implies some additional structure which one can exploit 
to improve the situation. The known proofs of Roth's theorem (or more generally 
Szemeredi's theorem) differ on exactly what this additional structure could be, and how 
to exploit it, but they essentially fall into one of two categories 4 : 

• A density increment argument seeks to use the lack of uniformity in b to pass 
from Z/iVZ (or [1, N]) to a smaller object on which the function / (or the set 
A) has a larger density. One then iterates this procedure until uniformity is 
obtained; this algorithm terminates since the density is bounded. 

• An energy increment argument seeks to use the lack of uniformity in b to improve 
the decomposition / = g+b, replacing the good function g by a function of larger 
energy (L 2 norm). One then iterates this procedure until uniformity is obtained; 
this algorithm terminates since the energy is bounded. 

Both approaches are important to the theory, as they have different strengths and 
weaknesses. We illustrate this by giving two proofs of Roth's theorem, one for each of 
the above approaches. But we shall need some additional notation first; this notation 
may seem somewhat cumbersome for this application, but will become very convenient 
when we discuss the case of larger k in later sections. 

Definition 2.5 (cr-algebras) . Let X be a finite set (such as Z/iVZ or [l,iV]). A o- 
algebra B in X is any collection of subsets of X which contains the empty set and the 
full set X, and is closed under complementation, unions and intersections. We define 
the atoms of a a-algebra to be the minimal non-empty elements of B (with respect to 
set inclusion); it is clear that the atoms in B form a partition of X, and B consists 
precisely of arbitrary unions of its atoms (including the empty union 0); thus there 
is a one-to-one correspondence between a-algebras and partitions of X. A function 
/ : X — > C is said to be measurable with respect to a a-algebra B if all the level sets 
of / lie in B, or equivalently if / is constant on each of the atoms of B. We define 
L 2 (B) be the space of immeasurable functions, equipped with the Hilbert space inner 
product (f,g)L 2 (x) '■= E(/p). We can then define the conditional expectation operator 
/ I— > E(/|jB) to be the orthogonal projection of L 2 (X) to L 2 (B). An equivalent definition 



4 Szemeredi's proof of Szemeredi's theorem in is a blend of the density increment and energy 
increment arguments. 
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of conditional expectation is 

E(f\B)(x) :=E(f(y)\yeB(x)) 

for all x G X, where B(x) is the unique atom in B which contains x. It is clear 
that conditional expectation is a linear self-adjoint orthogonal projection on L 2 (Zn), 
preserves non-negativity, expectation, and constant functions. In particular it maps 
bounded functions to bounded functions. If E(/|£>) is zero we say that / is orthogonal 
toB. 

If B, B' are two cx-algebras, we use B V B' to denote the cx-algebra generated by B 
and B' (i.e. the cx-algebra whose atoms are the intersections of atoms in B with atoms 
in B'). 

Proof. [Density increment proof of Roth's theorem] We now give what is essentially 
Roth's original argument, though not using Roth's original language (in particular, we 
give the sigma algebras of Bohr sets significantly more prominence in the argument). 

It is more convenient to work with the second formulation of Roth's theorem. Let 
5 > 0, and let N be a sufficiently large number depending on 5. Let Pq be a progression 
of length N, and let A be a subset of Pq of density at least 5. Our task is to prove that 
A contains at least one arithmetic progression. 

Without loss of generality we may take Pq = [1, N\. Set 5q := ¥(n G A : n G [1, N]), 
thus 6 < S < 1. 

Choose a prime p between 2N and AN. We embed [1,N] into Z/pZ in the obvious 
manner, thus identifying A with a subset of Z/pZ, of density at least 5/4. Let us let 
/ : Z/pZ — > E be defined by setting f(x) := 1a(x) when x G [1,N] and f(x) = 5o 
otherwise; observe that E(/) = 5q by construction. We then split / = g + b, where 

^:=E(/)^ < J aad6:=/-E(/) = l A - < y l[i,JV]. 

There are two cases, depending on whether b is linearly uniform or not. Suppose first 

that b is linearly uniform in the sense that ||6||zoo(2/pz) ^ c5 2 for some small absolute 
constant < c < 1; this is the "easy case". Since A 3 (g,g,g) = E(/) 3 ^ 5$ ^ <5 3 , we 
see from (j2.10|) that A 3 (/, /, /) ^ c'5 2 for some absolute constant d > (if c is chosen 
sufficiently small). By definition of / and A 3 , this means that 

F(n,n + r,n + 2r G A\(n,r) G Z/pZ x Z/pZ) ^ c'5 2 . 

The contribution of the r = case is at most 0(l/p) = 0(1/ N). Thus if iV is large 
enough, we thus see that there exists at least one pair (n, r) G Z/pZ x Z/pZ with r ^ 
such that n, n + r, n + 2r in A. Since A C [1, N], this forces n G [1, iV] and 1 ^ \r\ ^ N. 
Since p > 2iV, this implies that A (thought now as a subset of Z rather than Z/pZ) 
also contains a non-trivial arithmetic progression n, n + r, n + 2r, as claimed. 

Now suppose we are in the "hard case" where b is not linearly uniform, then there 
exists a frequency £ G Z/pZ such that ^ c<5 2 . By definition of 6 and the Fourier 

transform, we thus have 

|E((U(n) - 5 l[i,7V]H)e p (-<)|n G Z/pZ)| ^ c5 2 . 

Transferring this back from Z/pZ to [1, N], we obtain 

|E((l A (n) - 5 )e p (-n0\n G [l,iV])| > c5 2 
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(with a slightly different constant c). If we let \ : [1, iV] — > C be the linear phase 
function x(n) := e p (n£), we see that 1^ — <5 thus has some correlation with x- 

\(lA-S ,x)m[i,N])\ >cS 2 . (2.12) 

Now let < e < 1 be a small quantity depending on S to be chosen later. We 
partition the complex plane C = UoeQ e Q sc L uares °f side-length e in the standard 
manner (i.e. the corners of the square lie in the lattice el?), and let B £iX be the a- 
algebra on [1,N] generated by the atoms {x _1 (Q) : Q £ Qe}', sets of this type are also 
known as Bohr sets. Observe that there are only 0(1/ e) non-empty atoms. Then on 
each atom, x can only vary by at most 0(e), and thus we have the pointwise estimate 

X-E( X \B e>x )=0(e). 

Since 1a — So is bounded, we thus see from (I2.12|) and the triangle inequality that 

| (1 A - 5 , E{x\B £;X )) L 2 {[ljN]) \^c5 2 - 0(e). 

Since conditional expectation is self-adjoint, we have 

{I a ~ So,E(x\B £tX )) L 2 ([ljN]) = (E(U - S \B £7X ),x)l2([i,n]), 

and thus by boundedness of x 

\\E(l A -S \B £;X )\\ LH[1 , N]) >cS 2 -O(e). 

If we choose e := c'S 2 for some suitably small absolute constant < (/ < 1, the left-hand 
side is at least cS 2 /2. Now observe that E(1 A — So\B £}X ) has mean zero: 

E(E(1 A - S \B £>X )) = E(1 A - So) = E(1 A ) - S = S - S = 0. 

Thus we see that the positive part of E(1 A — S \B £tX ) is large: 

E(E(1 A -S \B £ , X ) + ) ^cS 2 /4. 

Now recall that £> £jX is generated by 0(1/ e) = 0(S~ 2 ) non-empty atoms. By definition 
of conditional expectation and the pigeonhole principle, we can thus find some atom 
X l (Q) °f B £)X of density at least c"<5 4 such that 1 A — S is biased on this atom: 

E(l A (n) - S \n G X'\Q)) > c"'S 2 , 

and thus 

P(n E A\n e x~\Q)) ^S + d"S 2 . (2.13) 
This is a density increment; A is denser on x~ l (Q) than it is on [1, N]. However, x^ 1 (Q) 
is a Bohr set instead of an arithmetic progression. However, the Bohr set is in some 
sense "very close" to an arithmetic progression in the sense that it can be covered quite 
efficiently by somewhat long arithmetic progressions 5 . This can be seen as follows. By 
the pigeonhole principle, one can find an integer 1 ^ q ^ such that 




where \\x\\ denotes the distance of x to the nearest integer. From this one easily observes 
that if n G X l (Q)i then there is an arithmetic progression containing n of spacing q and 
length comparable to e\fN which is completely contained in x 1 (Q)- I n particular, one 

5 This step is not particularly efficient when it comes to quantitative constants. A more refined 
argument of Bourgain ,5, works entirely with Bohr sets rather than arithmetic progressions, and obtains 
the best bounds on N (5) to date (namely N (5) < CS^ C ^ 2 ). 
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can partition \ into disjoint arithmetic progressions, each of length comparable 

to e\[N ^ C~ 1 5 2 \fN . From ()2.13j) and the pigeonhole principle, we thus see that at 
least one of these progressions P\ has large density: 

P(n G A\n e Pi) > 5 + c'"5 2 . 

To summarize, we had started with a subset A of a progression P Q of length N 
which had density 5q, and concluded that either A contained an arithmetic progression, 
or there was a sub-progression P' of length at least C~ 1 8 2 y/N where A has density 
5i > 5o + c'"5 2 for some absolute constant c'" > 0. We can then pass to this progression 
P' and repeat the argument (note that we can make C~ 1 8 2 \/ r N as large as we please by 
requiring N to be sufficiently large). The density can only increase by c"'S 2 by at most 
0(1/S 2 ) times 6 , and so this argument must eventually yield a non-trivial arithmetic of 
length three in A. □ 
Proof. [Energy increment proof of Roth's theorem] We now give an energy increment 
proof of Roth's theorem, inspired by arguments of Furstenberg [10] . Bourgain jl], and 
Green [20] , as well as later arguments by Green and the author in , [H] . This is not 
the shortest such proof, nor the most efficient as far as explicit bounds are concerned, 
but it is a proof which has a relatively small reliance on Fourier analysis and thus which 
generalizes fairly easily to general k. The structure of this argument, and the concepts 
introduced, are particularly crucial when establishing long arithmetic progressions in 
the primes. 

We shall use the third formulation of Roth's theorem; unlike the preceding proof, we 
will not oscillate back and forth between progressions and cyclic groups, but remain in 
a fixed cyclic group Z/NZ throughout. Thus, we let N be a large prime, and let / be a 
bounded non-negative function on Z/NZ obeying the bound (j2.1|) . Our task is to prove 

Q. 

We need some additional notation. 

Definition 2.6 (Almost periodic functions). A linear phase function is a function \ : 
Z/NZ — > C of the form x{ n ) = e iv(^0 f° r some £ G Z/NZ, which we refer to as the 
frequency of x- If K > 0, then an K-quasiperiodic function is a function / of the form 
$2j=i c jXj> where each Xj is a linear phase function (not necessarily distinct), and Cj 
are scalars such that \cj\ ^ 1. If a > 0, then an (a, K)- almost periodic function is a 
function / such that ||/ — fQp\\L 2 (z/NZ) ^ o" for some fT-quasiperiodic function /gp. 

Observe that if / and g are (a, i^)-almost periodic functions, then fg is a (2a, K 2 )- 
almost periodic function (taking (fg) K 2 := fi^gx)- 

A key property of almost periodic functions is that one can obtain non-trivial lower 
bounds on the A 3 quantity: 

Lemma 2.7 (Almost periodic functions are recurrent). Let < 5 < 1 and < o ^ 
<5 3 /100 ; and f be an bounded non-negative (a, K)- almost periodic function obeying (j2.1j) . 
Then we have 

MfJJ)><K,8)-o n>s (l) 
for some c(K,M,5) > (the key point here being that this quantity is independent of 
N)- 

6 One can improve this to 0(1/ S) by observing that the density increment of c'"5 2 can be refined to 
c">Sl 
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Proof. Let f QP = V . 

=1 c jXj be the if-quasiperiodic function approximating /, and 
let < e be a small number (depending on if, 5) to be chosen later. Let £i, . . . 
be the frequencies associated to the characters Xu ■ ■ ■ > Xk- By Dirichlet's simultaneous 
approximation by rationals theorem (or the pigeonhole principle), we have 

P(||r£/|| < £ for all 1 < j < if|r G Z/JVZ) > c(e, if) (2.14) 

for some c(e, if) > independent of N. Next, observe from the triangle inequality that 
if r is as above, then 

\\T r fAP — /ap||l2(z/atz) ^ C(K)£ 
where T r is the shift map T r f(x) := f(x + r). From this and the triangle inequality, 
we conclude 

\\T r f-f\\ L ^ m) O 3 /10 + C(if)£, 
and by another application of T r , we have 

\\T 2r f - T r f\\ L2{z/m <: S 3 /10 + C(K)£. 

From this and the boundedness of /, we conclude that 

WfT r fT 2r f - f\\ L i imz) ^ 5 3 /2 + C(K)£, 

but from the bounded non-negativity of /, (|2.1j1 . and Holder's inequality we have 

ll/ 3 |U 1 (z/ra) ^ \\f\\h(z/NZ) ^ <^ 3 

and hence (by positivity of /) 

E(fT r fT 2r f(n)\n e Z/NZ) ^ 5 3 /2 - C{K)£. 

If we choose e small enough depending on 5 and M, we thus have 

E(fT r fT 2r f{n)\n G Z/NZ) ^ 5 3 /A. 

Averaging over all r, using (|2.14|) and the non-negativity of /, we obtain 

E{fT r fT 2r f{n)\n,r G Z/NZ) > 5 3 c{£,K)/A. 

But the left-hand side is nothing more than A 3 (/, /, /). The claim follows. □ 
To exploit the above result we shall need to approximate a general function / by an 
almost periodic function, plus a linearly uniform error. The first step in this strategy 
shall be to construct a-algebras such that the measurable functions in this algebra are 
all almost periodic. 

Lemma 2.8. Let < e 1 and let \ be a linear phase function. Then there exists a 
a-algebra B £tX such that \\x — E(x|£>£,x)IU°° ^ C £ > an d suc ^ that for every a > 0, there 
exists K = K(a, e) such that every function f which is measurable with respect to B £tX 
and obeys the bound ||/||l°°(z/jvz) ^ 1 is (a ', if)- almost periodic. 

Proof. We use a random construction, constructing a cr-algebra which has the stated 
properties with non-zero probability. Let a be a randomly selected element of the unit 
square in the complex plane, and let £> £jX be the a-algebra with atoms of the form 
{x _1 (Q) : Q G Q e + £a\. Then as in the previous proof of Roth's theorem, we have 
||x — E(x|£> £)X ) ||ioo ^ Ce. Now we prove the approximation claim. It suffices to verify 
the claim for o = 2~ n for some integer n ^> 1, with probability 1 — 0(a). Also, since B £tX 
has at most C(e) atoms, it suffices to verify the claim when / is the indicator function 
of one of these atoms A, with probability 1 — 0(C(e:) _1 o"). 
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The function / can be rewritten as f(x) = 1q(x(x) —eat). We can use the Weierstrass 
approximation theorem to approximate 1q(z) on the disk z = 0(1/ e) by a polynomial 
P(z,z) involving at most C(o~, e) terms and with coefficients bounded by C(a, e) such 
that \P\ is bounded by 1 in this disk, and 1q(z) — P(z,~z) = 0(C~ x a) for all z in this 
disk, except for a set of measure 0(C(s)~ 2 a 2 ). A standard randomization argument 
then allows us to assert that 

Uq(x(x) ~ ear) - P(x(x) - ea,x(x) - ea)\\ L 2 (z/m < a 

with probability 0(C(e)~ 1 a). But P(x(x) — ea, x( x ) ~ ea ) can be written as the linear 
combination of at most C(e, a) characters, with coefficients at most C(e, a), and is thus 
C(e, a)-quasiperiodic (one can reduce the coefficients to be less than 1 by repeating 
characters as necessary). The claim follows. □ 
One can concatenate these a-algebras together. If Bi, . . . , B n are cx-algebras, we let 
B\ V . . . V B n be the smallest cx-algebra which contains all of them. 

Corollary 2.9. Let < £\,...,e n <C 1 and let Xii---iXn be linear phases. Let 
B £uXl , . . . , B euXn be the a -algebras arising from the above corollary. Then for every 
a > 0, there exists K = K(n,a,e\, . . . ,e n ) such that every function f which is mea- 
surable with respect to B eitXl V ... V B eitXn and obeys the bound ||/||.l°°(z/jvz) ^ 1 is 
(a, K) -almost periodic. 

Proof. Since the number of atoms in this a-algebra is at most C(n, e±, . . . , e n ), it 
suffices to verify this when / is the indicator function of a single atom. But then / 
is the product of n indicator functions from atoms in B El , Xl , • • • , B EnyXn1 and the claim 
follows from the preceding lemma and the previously made observation that the product 
of almost periodic functions is almost periodic. □ 
The significance of these cr-algebras is not only that they contain functions which are 
almost periodic and hence have non-trivial bounds on the A 3 form, but also that they 
capture "obstructions to linear uniformity" : 

Lemma 2.10 (Non-uniformity implies structure). Let b be a bounded function such that 
\\b\\i°o ^ a > 0, and let < e a. Then there exists a linear phase function x with 
associated a-algebra B £)X such that 

m(b\Bz,x)\\mz/NZ) >G- X a. 

This is proven by a repetition of the arguments used in the first proof of Roth's 
theorem, and we leave it to the reader. 

We can now assemble all these ingredients together to prove Roth's theorem. The 
major step here is a structure theorem which decomposes an arbitrary function into an 
almost periodic piece and a linearly uniform piece. 

Proposition 2.11 (Quantitative Koopman-von Neumann theorem). Let F : 1R + x 

M + — > M + be an arbitrary function, let < 8 ^ 1, and let f be any bounded non- 
negative function on Z/iVZ obeying (12.1)1 . Let o := <5 3 /100. Then there exists a quantity 
< K ^ C(F, 5) and a decomposition f = g + b, where g is bounded, non-negative, has 
mean E(g) = E(/) ; and (a, K)- almost periodic, and b obeys the bound 



\\b\\ loo ^F(6,K). 



(2.15) 
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Proof. We apply the following energy incrementation algorithm to construct g and b. 
We shall need two auxiliary a-algebras B and £>', with B' always being larger than or 
equal to B. Also, B will always be of the form B = B ei)X1 V ... V B £ntXn for some n, some 
£i,...,e n > 0, and some xi, • • • ? Xn, and similarly for B' (but with different values of 
n); also we will have the bound 

L 2 (Z/NZ) 

^||E(/|£)||| 2(z/ra)+( x 2 /4 (2.16) 
or equivalently (by Pythagoras' theorem) 

\\E(f\B')-E(f\B)\\ LHz/m ^ a/2. (2.17) 

• Step 0: Initialize B = B' = {0,Z/iVZ} to be the trivial cx-algebra. Note that 
()2.16|) is trivially true at present. 

• Step 1: By construction, we have B = B EltXl V . . . V£> eniX „ for some £i, . . . , £ n > 
and linear phase functions Xu---,Xn- The function E(f\B) is bounded and 
measurable with respect to B. By Corollary 12.91 we can thus find K depending 
on 6, n, £i, . . . , £ n such that E(/|£>) is (o"/2, i^)-almost periodic. 

• Step 2: Set g := E(f\B') and b = f - E(f\B'). If ||S|| { <» ^ F(S,K) then we 
terminate the algorithm; otherwise we move on to Step 3. 

• Step 3: Since we have not terminated the algorithm, we have > F(S,K). 
Using Lemma 12.101 we can then find e = F(5,K)/C and a character x, with 
associated a-algebra B ea , such that 

\\E(b\B EtX )\\ L 2 {z/m >C- l F{8,K). 

From the identity 

E(b\B e , x ) = E(E(f\B' V B ex ) - E(f\B')\B E , x ) 
and Pythagoras's theorem, we thus have 

\\E(f\B' VS £ J -E(f\B')\\ L 2 {z/NZ) > C~ 1 F(S, K), 
which by Pythagoras again implies the energy increment 

\\E(f\B' V B £ , x )\\l Hz/NZ) > \\E(f\B')\\l Hz/m +C- 2 F(5,K) 2 . 

• Step 4: We now replace B' with B' V £> £)X . If we continue to have the property 
()2.16|) . thne we return to Step 2. Otherwise, we replace B with B' and return to 
Step 1. 

Let us first see why this algorithm terminates. If B (and hence K) is fixed, then 
each time we pass through Step 4, the energy ||E(/|£>') ll^^/^z) increases by at least 
C~ 2 F(5, K) 2 . Thus either we terminate the algorithm, or ()2.16|) must be violated, within 
Ca 2 /F(5, K) 2 = C(F,5,K) steps. If the latter occurs, then B is replaced by a new a 
algebra involving C (F, 5, K) new characters, with corresponding e parameters which are 
bounded from below by C(F, 5, K)" 1 . This implies that the K quantity associated to B 
will be replaced by a quantity of the form C(F, 6, K). Also, the energy ||E(/|^)||^2( Z /jv Z ) 
will have increased by at least a 2 /A, thanks to the violation of ()2.1(ij) . On the other 
hand, since / was assumed bounded, this energy cannot exceed 1. Thus we can change 
B at most 0(a~ 2 ) times. Putting all this together we see that ths entire algorithm must 
terminate in C(F, 6) steps, and the quantity K will also not exceed C(F, 5). (Note 
that these constants can be extremely large, as they will involve iterating F repeatedly; 
however, the key point is that they do not depend on N). 
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The claims of the proposition now follow from construction. Note that E(/|23) is 
(cr/2, i^)-almost periodic by construction, and hence g = E(/|£>') will be (a, K)-almost 
periodic thanks to (j2.17J) . □ 

We can now finally prove Roth's theorem. We let F : M + — > R + — > R + be a function 
to be chosen later, and apply the above Proposition to decompose / = g + b. By Lemma 
12.71 we have 

A 3 (#, g, g) > c(K, 5) - o n><5 (l) 

and then by ()2.10|) and (|2.15j) we have 

A 3 (/, /, /) > c{K, 6) + 0(5F(K, 6)) - o n>s {l). 

By choosing F sufficiently small, we can absorb the second term in the first, thus 

A3(fJJ)>c(K,8)/2-o n>s (l). 

Since K C(F, 5) = C(5), the claim (23J) now follows. □ 
We remark that there are several other proofs of Roth's theorem in the literature, 
notably Szemeredi's proof based on density increment arguments and extremely large 
cubes (see [19J), and an argument based on the Szemeredi regularity lemma (which in 
turn requires energy increment arguments in the proof) in jSTj. While these arguments 
are also important to the theory and both have generalizations to higher k, we will not 
discuss them here due to lack of space. 



3. Interlude on multilinear operators 

We will shortly turn our attention to Szemeredi's theorem. Based on the preceding 
section, it is unsurprising that much of the analysis will revolve around the multilinear 
form 

fc-i 

A fe (/o, • • • , fk-i) ■= E( J] f)(x + jr) \x, r G Z/JVZ) 

j=0 

for a large prime N. It turns out that to analyze this multilinear form, it is convenient 
to generalize substantially and consider multilinear expressions of the form 

d d 

E(#(x)n^(*e]I^) 

3=1 3=1 

where d ^ 1 is fixed, Ai,...,Ad are finite non-empty sets, K : Ylj=i Aj -> C is a 
fixed kernel, x = (x\, . . . , x^), and each Fj : f7 J=1 Aj — > C is a bounded function 
which is independent of the xi co-ordinate (and thus only depends on the other n — 1 
co-ordinates). 

Henceforth we fix d and A±, . . . , A d . Let {0, l} d be the discrete unit cube. We need 
the following notation: if x^ = (x^, . . . } x^) and x^ = (x±, . . . } x^) are elements 
of U d j=i A j, and e = (e 1 ,...,e d ) G {0, l} d , then we write x {e) := (xj £l) , . . . , x { d d) ) G 

rijLiAn anc ^ refer to the 2 d -tuple (^^) e g{o ) i} d °f elements in [1^=1 A? as ^ ne cu ^ e 
generated by x^ and x^ 1 -*; this is a cube in the combinatorial sense rather than the 
geometric sense. Thus for instance, when d — 2, the cube generated by (x, y) and 
(x',y r ) is the 4-tuple consisting of (x,y), (x,y f ), (x r ,y), and (x',y'). 
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Now suppose we have a 2 d -tuple of kernels : Yij=i A? ~ > ^ f° r eacn £ £ {0, 
We define the Gowers inner product ((A( e )) sg r 0) nd}nd to be 

d 

eG{0,l} d i =1 

where C/ := / is the conjugation operator, and |e| := J2'j=i £ j- By separating the d th 
co-ordinates of a^ ** and we observe the identity 

((K^) emiV ) Dd = E(E( J] C^ )^,^ e 40 

ee{0,l} d ~ 1 

d-i (3.2) 

CE( J] &-\K^\x^\y)\y G A*)|^ (0) ^ (1) e JJ^) 
ee{o,i} d - 1 i =1 

Applying Cauchy-Schwarz in the variables x} ',^ 1 ', we conclude that 

K(^ (£) ) £e{ o,i} d )ffl < ((^ M) ) £e{ o,i} d )n / d 2 ((^ fe ' 1) ) £e{ o,i} d )n / d 2 ) 

where e := (ei, . . . , £d_i) are the first d — 1 co-ordinates of e; note that ()3.2|) ensures 
that the inner products appearing in the right-hand side of the above equation are non- 
negative reals. Of course one has a similar inequality if we work with the j th co-ordinate 
instead of the d th co-ordinate for any 1 ^ j ^ d. Applying the above Cauchy-Schwarz 
inequality once in each co-ordinate, we obtain the Gowers-Cauchy-Schwarz inequality 

((# (£) ) ee{ o,i}*>n« < n ii^ (£) ii° d ( 3 - 3 ) 

ee{0,l} d 

where ||A||oi is the Gowers cube norm 

\\K\\ nd := ((A") ee{0)1} d)J ] / ' d 2 . 

Again, the identity (|H.2jl ensures that this norm is non-negative. Using the multilinearity 
of the Gowers inner product, we then observe for an arbitrary pair Kq, K\ of kernels 
that 

\\K Q + Ai||Q d = {(Kq + Ai) £e{0 ,i}<i)nd 

= ((-^i-4(e))Ee{o,i} d )n d 

Ac{0,l} d 

^ yi n ii K iA( £ )ib d 

Ac{0,l} d e£{0,l} d 

= (||A || nd + ||A 1 || nd ) 2[i 

which thus yields the Gowers triangle inequality 

II A + Ai|| D d ^ ||A || D d + ||A^i|| D d. 

Since the Gowers cube norm is clearly homogeneous, we thus see that || • ||qj is a semi- 
norm. We will later show that it is in fact a norm when d ^ 2; when d — 1 we have 
||A||gi = |E(A(x)|a; e Ai)| which is degenerate and thus not a genuine norm. 

The significance of the Gowers cube norm to expressions of the form (j3.1|) lies in the 
following estimate (which is implicit in [S] and also in [T7j). 
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Lemma 3.1 (Van der Corput lemma). Let d ^ 1, let Ai, . . . ,A d be finite non-empty 
sets, let K : Ylj =1 Aj — > C, and for each 1 ^ i ^ d let Fj : Y\j=i A? ^ C is a bounded 
function which is independent of the Xi co-ordinate. Then we have 

d d 

\E(K(x) l[F s (x)\x G Y[Aj)\ ^ \\K\\ Dd . 

3=1 3=1 

Proof. We induct on d. When d = 1, the claim becomes 

\E(K(x 1 )F 1 (x 1 )\x 1 G A ± )\ < \E(K(x 1 )\x 1 G A ± )\, 

which follows since F 1 is independent of x\ and is bounded. 

Now suppose that d ^ 2 and the claim has already been proven for d — 1. Since 
F d is independent of the x d co-ordinate, we may abuse notation and interpret F d as a 
function on YYj^i A? r& ther than n^=i A*- We then separate off the Xd co-ordinate to 
write 

d d d-l d-l 

E(K(x)Y[F J (x)\x G ]jA J )=E(F d (x)E(K(x,x d )l[F :j (x,x d )\x d G A d )\xE JjA,-). 

3=1 3=1 3=1 3=1 

Since F d is bounded, we may apply Cauchy-Schwarz in the x variable to then obtain 

d d d-l d-l 

\E{K{x) HF^x G Y[Aj)\ < EdEiK&x^llFj&x^Xd G A d )\ 2 \x G J]^) 
j=i j=i j=i j=i 

= E(E(K(x, xf)K{x, xf) J] F 3 (x, x®)F s (x, x™)\x G fj^) 

j=i j"=i 



( ° ) ,4 1) eA d )V 2 . 



For each fixed x^\x^ G A and each 1 ^ j ' ^ d, the function Fj(x,x^)Fj(x,x^) is a 
bounded function of x. If we then apply the induction hypothesis we have 



i=i j=i 
so by Holder's inequality 

d d 

\E(K(x)l[F,(x)\x G n^)l < E(||ir(-,xi 0) )K(-,4 1) )||gr i |xi 0) ,xi 1) G A,) 1 ^. 

J"=l 3=1 

But the right-hand side can be re- arranged to be precisely ||i^|| D d, and the claim follows. 
□ 

We can now show that || ■ \\jjd is a genuine norm when d ^ 2: 
Corollary 3.2. If d ^ 2 and || J pr|| C7 -«i = 0, toen X = 0. 

Proof. Let (xi, . . . , G rij=i A? be arbitrary. We then define fi : YYj=i A? ^ C by 
defining fi(y u ...,y d ) = l when yj = Xj for all j ^ i, and fi(y u ...,y d ) = otherwise. 
Applying the previous lemma we thus see that K(xi, . . . ,x d ) = 0. Since (xi, . . . ,x d ) 
was arbitrary, the claim follows. □ 
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Let us informally call a kernel K Gowers uniform if it has small D d norm. Then the 
van der Corput lemma then asserts that Gowers uniform kernels are negligible for the 
purpose of computing multilinear expressions such as (j3.1|) . In particular, when d = 2, 
the D 2 norm of a kernel K (which can now be interpreted as a linear operator T# from 
L 2 (Ai) to L 2 (A2)) controls the L 2 operator norm of K. Indeed, one has the identity 

\\ K \\u 2 = ll^^dl#s(L 2 (Ai)->L 2 (Ai)) 
* 1/2 

= \\ T kT* K \\ hs(l 2 [A2) ^ L 2 {A2)) 

= ti Al (T* K T K T^T K )^ 4 



^a 2 (TkT^TkT'k 



,1/4 



where H S is the normalized Hilbert-Schmidt norm, and tr^ is the normalized trace on 
A; equivalently, H-R'Hcp is the I 4 norm of the (normalized) singular values of K, while the 
operator norm is the l°° norm of these singular values (and the Hilbert-Schmidt norm is 
the I 2 norm). Thus one can view the \3 d norm as a multilinear generalization of the I 4 
Schatten-von Neumann norm. This norm has also arisen in the study of pseudorandom 
sets and graphs, see for instance 

Now we specialize to the problem of counting arithmetic progressions in Z/iVZ. 

Definition 3.3 (Gowers uniformity norm). Let / : Z/iVZ — > C be a function and d ^ 1. 
Then we define the Gowers uniformity norm \\f\\u d to be the quantity ||/||[/ d := ||n<*, 
where K : (Z/NZ) d -> C is the kernel 

K(x 1 , ...,x d ):= f(x 1 + . . . + x d ). 

Equivalently, we have 

d 

\\f\\ ud :=E( J] Cl £ l/(x + ^^^)|x,/i 1 ,...,/i d GZ/iVZ) 1 / 2d , 
£e{o,i} d i= l 
or alternatively we have the recursive definitions 



|E(/)|; H/ll^i :=n\\f^ + h)f{x)\\f; d \heZ/NZfl 2d+1 . (3.5) 



Since \3 d was a norm for d ^ 2, we see that U d is also a norm when d ^ 2. In the 
d = 2 case, one can easily verify the identity 



it/ 2 : = imi« 4 > 

which can be viewed as a special case of (|3.4|) . observing that the Fourier coefficients 
of / are essentially the eigenvalues of K. However, for d ^ 3 the £/ d norm becomes 
more complicated, and has no particularly useful representation in terms of the Fourier 
transform. Using the Gowers-Cauchy-Schwarz inequality, it is possible to show the 
monotonicity relationship ||/||{/d ^ ||/||(7 d + 1 f° r & U d; one can also show that ||/||f/d — > 
|| /|| loo as d — > oo. We shall neither prove nor use these facts here. 
We can now obtain an analogue of (j2.8j) . 

Lemma 3.4 (Generalized von Neumann theorem). [T7] Let k ^ 3, and Zet N be a prime 
larger than k. Let fa, . . . , fk~\ be bounded functions on Z/iVZ. Then we have 

|A fc (/o,...,/ fc -i)| ^ mm H/illi/fc-i. 
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Proof. Fix ^ j ^ k — 1; it thus suffices to show that 

|Afc(/o, . . . ,/fc-i)| < ll/illc/fc- 1 - 
Observe that for any x\, . . . , Xk-i € Z/iVZ, the sequence 

{xi + ...+ x k _i - (j -i) ^ —^—:Xii)i^k 

is an arithmetic progression of length k in Z/iVZ (here we are using the hypothesis that 
TV is prime and larger than k in order to invert j — i'). Conversely, each progression 
x, x + r, . . . , x + (k — l)r can be expressed in the above form in exactly the same number 
of ways (N k ~ 3 , to be exact). We may thus write 

fc-i ^ 
A*(/o, ■■■,/*-!) =E(JJ/ i (x 1 +...+x fc _i-0'-i) -r^XiOlxi,...,^-! e Z/iVZ). 

Now observe that the i t/l factor in the above sum is bounded and will not depend on Xi 
when i ^ j, and that the j th factor is fj(xi + . . . + Xk-i)- Applying the van der Corput 
lemma and the definition of the U k ~ l norm, we obtain the claim. □ 
Let us informally call a bounded function / Gowers uniform of order k — 2 if ||/||[/fe-i 
is small; thus for instance a function with small U 2 norm is linearly uniform, a function 
with small U 3 norm is quadratically uniform, and so forth. The above lemma then 
asserts that functions which are Gowers uniform of order k — 2 have a negligible impact 
on the Afc multilinear form. 

Example 3.5. Let N be a prime number, let P : Z/iVZ — > Z/iVZ be a polynomial of 
degree d in the field Z/iVZ ; and let f(x) := e7v(P(x)) ; thus f is a bounded function. 
One can easily verify that H/Ho^- 1 = 1 when d ^ k — 2 (basically because the (k — l) th 
derivative of P vanishes), so that P is not uniform of any order d or greater. (In 
fact, one has the more general statement that \\fg\\u k ~ 1 — IMIt/*- 1 f or arbitrary g and 
whenever d ^ k—2; thus the U k ~ l norm is invariant under polynomial phase modulations 
of degree k — 2 or less). On the other hand, one can verify that = Od(iV -1 / 2 ) 

when d > k — 2; this is easiest to accomplish when d = k — 1, and the remaining cases 
follow by monotonicity (or van der Corput type arguments for Weyl sums). Thus P is 
uniform of order d — 1 or less. The intuition to have here is that a bounded function is 
(heuristically) uniform of order d iff its phase is "orthogonal" to all polynmial phases of 
degree d or less. In the d = 1 case this intuition is precise: linear uniformity corresponds 
to being orthogonal to linear phase functions, as the estimates fj2.11|) already attest to. 
When d ^ 2 however this intuition is harder to pin down, and the theory is still not 
completely understood. 

Now consider a quadratic polynomial P(x), with corresponding quadratic phase func- 
tion f(x) := Cn{P{x)). From the identity 

P(x) - 3P(x + r) + 3P(x + 2r) - P(x + 3r) = 

(which reflects the fact that the third derivative of P), we observe that 

a 3 (/,7,/,7) = i. 

Thus f is non-negligible for the purposes of computing the A 3 form. This is despite f 
being linearly uniform (all the Fourier coefficients of f is 0(N~ l l 2 ), as one sees from 
the classical theory of Gauss sums). This shows that for the purposes of analyzing A 3; it 
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is really quadratic uniformity which is the concept to be studied, not linear uniformity. 
Similarly, the concept of being Gowers uniform of order k — 2 is the one which is related 
to the form A k , which in turn counts arithmetic progressions of length k. 

4. Progressions of length 4 

With the above machinery, we can now sketch two different proofs of Szemeredi's 
theorem for progressions of length 4. (These arguments also extend, with some addi- 
tional difficulties, to higher k, but we will not discuss these technicalities here). The 
first proof we present is due to Gowers [THj and can be viewed as a generalization of 
Roth's Fourier- analytic argument, being a density-incrementation argument using qua- 
dratic Fourier analysis instead of linear Fourier analysis. The second proof is adapted 
from that in jH], which in turn is based on the original ergodic theory arguments of 
Furstenberg and co-authors ^ * s a generalization of the second proof of Roth's 

theorem given earlier; in particular, it is is an energy-incrementation argument based 
on the decomposition of an arbitrary function into a "almost periodic function of order 
2" and a quadratically uniform function. 

We begin by discussing Gowers' proof, though we shall omit many of the details which 
pertain to arithmetic combinatorics. Once again, we have a subset A of [1, N], which we 
embed into a cyclic group Z/pZ of prime order. We split f = 1a = g + b, where g = E(/) 
and b = f — E(/). If b is quadratically uniform in the sense that \\b\\jj3 is suitably small 
(less than c5 c for some absolute constants c,C > 0) then, by using Lemma 13.41 to 
develop an analogue of Proposition 12.31 then one can easily obtain non-trivial lower 
bounds for A 4 (/, /, /, /) and thus establish plenty of arithmetic progressions of length 
4 in A. 

The difficulty comes in the "hard case", when b is not quadratically uniform, so 
that \\b\\tj3 is relatively large. The difficulty here is that unlike the U 2 norm, which is 
the l A norm of the Fourier transform, the U 3 norm is not easily related to the Fourier 
transform; for instance in Example 13. 51 we saw that there were functions which had very 
small Fourier transform but had large U 3 norm. Nevertheless, it is still possible to use 
this information to deduce some structural information about A. The situation can be 
clarified somewhat by considering a model problem, which is to determine all functions 
of the form b = e p ((j)(x)) which had the maximal U 3 norm of 1, where <fi : Z/pZ — * Z/pZ 
is a phase function. Expanding out the U 3 norm, we see that this is equivalent to asking 
that 

4>(x+r+s+t)—cj)(x+r+s)—cj)(x+r+t)—(j)(x+s+t)+(j)(x+r)+(j)(x+s)+cj)(x+t)—(j)(x) = 

(4-1) 

for all x, r,s,t G Z/pZ. This is an "arithmetic" way of asserting that the third derivative 
of vanishes. It in fact implies that is a quadratic polynomial, 4>{x) = ax 2 + bx + c 
(whereas in contrast, the assertion that b would have a maximal Fourier coefficient of 
1 is equivalent to asserting that is a linear polynomial). To see this, let us adopt 
the notation that for any function / : Z/pZ — > R/Z and any shift h G Z/pZ, that 
fh : Z/pZ — > R/Z denotes the "derivative" fh(%) '■= f(% + h) — f{x). Then we have 

4> h (x + s + t) - (f) h (x + s) - <p h (x + t) + <f) h (x) = for all x,s,t E Z/pZ. (4.2) 

It is easy to see that this implies that (ph is linear, i.e. we have 



4>h{ x ) = a{h)x + c(h) for all x, h G Z/pZ 



(4.3) 
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for some a(h), c(h) G Z/pZ. (This is easiest seen by first subtracting 0^(0) from (fth, 
at which point (p h becomes additive). To conclude from this that 4> is quadratic, one 
would need to firstly show that a(h) and c(h) have some linearity properties in h, and 
then "integrate" the equation (|4.3J) to obtain a quadratic expression for 
To attain these goals, we rewrite (|4.3|) as the functional equation 

a(h)x = (f)(x + h) - <f)(x) - c(h) for all x,hE Z/pZ. (4.4) 

We can isolate a(h) in this equation by taking suitable "derivatives". For instance, if 
one replaces x by x + s in the above formula to obtain 

a(h)(x + s) = 4>(x + s + h) — <p{x + s) — c{h) for all x,h,s G Z/pZ. 

and then subtracts the two equations, one obtains 

a(h)s = (f) s (x + h) — <fis(x) for all x,h,s G Z/pZ (4.5) 

thus eliminating the unknown function c(h). Similarly, by replacing h by h + t and then 
subtracting, we can eliminate the <p s (x) term to obtain 

at(h)s = 4> s t{x + h) for all x, h,s,t G Z/pZ. 

Finally, by replacing x, h by x — u, h+u and subtracting again to eliminate the (f> st (x+h) 
term, one obtains 

a>tu(h)s = for all h, s,t,u G Z/pZ (4.6) 

and thus a obeys the functional equation 

a(h + t + u) - a(h + t) - a(h + u) + a(h) = for all h, u, t G Z/pZ (4.7) 

which as observed earlier implies that a(h) is linear, thus 

a(h) = ah + (3 for some a, G Z/pZ. (4.8) 

(One can in fact force f3 to equal zero, basically because a(0) = 0, but we will not do so 
here). Now the function ahx can be explicitly integrated (modulo a lower order term) 
using the quadratic primitive 

F(x) := |x 2 , (4.9) 

in the sense that Fh(x) = ahx + ^h 2 . Thus if we define 4>'(x) := <f)(x) — F(x) and 
4>"(x) := (j)(x) — F(x) + j3x, then by (|4.3jl . obeys the functional equation 

0'(x + /i) - 0"(a;) = (3x + c(/i) - -/i 2 for all x,he Z/pZ. (4.10) 

Replacing x by x + k and subtracting, we obtain that 

4>'(x + h + k)- (j)'{x + h)- (j>"{x + k) + (j)"{x) = for all x,h,k G Z/pZ 

which then implies that <$' and <fi" is linear. Since <fi = <p' + F, we thus see that is 
quadratic as claimed. 

This concludes the treatment of the model problem. Thanks to the work of Gowers 
[12] , it turns out that the general strategy used to solve this model problem can also be 
used to handle the general case. Indeed, if a function b has large U 3 norm (where by 
"large" we mean "larger than C~ 1 5 c ' for some absolute constant C > 0"), then by (|3.5jl 
the function b(x + h)b(x) will have large U 2 norm for a large percentage of h G Z/pZ 
(this is the analogue of (|4.2|) ). Since U 2 norms imply large Fourier coefficients, we thus 
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see that for all h in a large fraction H C Z/pZ of Z/pZ we can find a(/i), c(/i) G Z/pZ 
such that 

KE(6(x + h)bjx)e p (a(h)x + c(/i))|x G Z/pZ) ^ L"" 1 ^ (4.11) 

and hence 

|E(6(x + h)bjx)e p (-(a(h)x + c(h)))l H (h)\x, h G Z/pZ)\ ^ C^d . 

As with the model problem, the task would now be to obtain some linearity control 
on a. This can be obtained by a Cauchy-Schwarz argument; there are a number of 
permutations of this argument, but we shall give one which is based on the van der 
Corput lemma, Lemma I3~T1 Let us first change variables x = y\ — y 2 , h = y 2 — y 3 to 
obtain 

|E(6(yi-j/ 3 )6(yi - y2)lH(y2-y3)e P (-c(y2-y 3 ))K(y 1 ,y 2 ,y3)\yi,y2,y3 e Z/pZ)| ^ C~ l 5 c , 
where 

K(y 1 ,y 2 ,y 3 ) := e p (-a(y 2 - y 3 )(yi ~ V2))^h{V2 ~ 2/a)- 
If we then apply Lemma 13.1) we conclude that 

\\K\\rp > C- l 5 c . 

Raising this to the eighth power and expanding out the left-hand side, one eventually 
obtains (after some change of variables) 

E(e p (-(a(h+t+u)-a(h+t)-a(h+u)+a(h))s)l h)h+Ujh+ti h+t+u£H\h,t,s,u G Z/pZ) ^ C~ l 8 c 

(this is the analogue of (|4.fij) ). The average in s can be computed explicitly, and we 
then obtain 

P(/t, h+t, h+u, h+t+u G H; a(h+t+u)-a(h+t)-a(h+u)+a(h) =0\h,t,uE Z/pZ) > C -1 5' 

(4.12) 

(cf. ()4.7p ). This is now a purely arithmetic-combinatorial statement about a, involving 
no oscillation; it says that a behaves like an (affine-) linear function "a significant fraction 
of the time" . In analogy with (|4.8jl It is then tempting to conjecture from this that a(h) 
should in fact equal an affine linear function ah + (3 for a significant fraction of the time, 
i.e. we should be able to find a,/3 G Z/pZ such that 

F(h G H; a(h) = ah + (3\h G Z/pZ) ^ C~ l 5 c (4.13) 

(note that in the converse direction, that one can use 1)4.13)1 and a Cauchy-Schwarz 
argument to obtain ()4.12|) ). Suppose for the moment that one could indeed deduce 
()4.13)) from ()4.12|) . Then we can introduce the primitive function (J4.9)) as before, and 
define b'(x) := b(x)e p (—F(x)) and b" (x) := b' '(x)e p ((3x); we then see from (14.11)1 that 

m{b'{x + h)¥{x)e p (^h 2 - c{h))\x G Z/pZ) > 0^5° 
for all h G H (cf. (14.10)) ). In particular we see that 

\E{b'{x + h)¥{x)\x G Z/pZ)| ^ 0^5° . 
Taking L 2 norms of both sides and using Plancherel, we obtain 

\\99'\\p ^ CT 1 ^, 

and thus by Holder's inequality 

\\b% 2 > C-H c . 
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To summarize, we started with a function b with large U 3 norm, and then were able 
to locate a quadratic modulation b' of b which in fact had large U 2 norm. Since we 
already know that a large U 2 norm would imply a large Fourier coefficient, we could 
thus deduce the existence of a £ G Z/pZ such that &'(£) is large, which would then 
imply that the original function b had large correlation with a quadratic phase function 
x{x) := e p (P(x)) for some quadratic polynomial P : Z/pZ, — > Z/pZ, thus \(b, %}| ^ 
C -1 ^. One can now proceed as in the density increment proof of Roth's theorem, 
but with the Bohr sets in £> £jX now being replaced by "quadratic Bohr sets" . This 
eventually gives us a density increment of the form ()2.13j) on a quadratic Bohr set 
X l {Q)'i one can then use Weyl's theorem on equidistribution of quadratic polynomials 
mod p to locate a reasonably long arithmetic progression (of length at least cN c for 
some absolute constant c > 0, if N is sufficiently large depending on 8) on which one 
has a density increment, at which point we may repeat Roth's argument. We omit the 
details, referring the reader instead to [TB] . 

We return briefly now to a step glossed over in the above sketch, namely the deduction 
of (|4.13J) from (j4.12j) . As it turns out, this implication is false as stated; it is possible 
for a to be additive in the sense of ()4.12jl without being approximately linear in the 
sense of (J4.13)) . because a may instead be behaving like a "higher-dimensional" linear 
function. An example of this is as follows. Let M be an integer between y/p/4 and 
y/p/2, let H := {n + 2Mm : 1 ^ n, m ^ M}, and let a : H -> Z/pZ be the function 
a(n + 2Mm) = an + (3m for some fixed a, (3 E Z/pZ. Then one can easily verify that a 
obeys the property ()4.12j) but not (|4.1Hj) (if (3 ^ 2Ma). The set H is an example of a two- 
dimensional arithmetic progression, and the function a given here is a generalized linear 
function on this progression; more generally one can define the notion of a generalized 
arithmetic progression (of arbitrary dimension), and of a generalized linear function 
on this progression; it is possible then to obtain a deduction of the form ()4.12j) =>- 
()4.13|) but with the role of ah + f3 being played by these generalized linear functions; 
also, for technical reasons (having to do with relatively poor constants in a certain 
inverse theorem from additive combinatorics known as Freiman's theorem) one must 
with the lower bound of C~ l 5 c by a smaller quantity such as exp(— C5~ c ); it is not 
known whether this exponential loss has to be removed. The deduction here requires a 
combination of techniques from combinatorial graph theory, probabilistic combinatorics, 
Fourier analysis, and the geometry of lattices and Bohr sets; it is somewhat involved 
and we will not go into the details here, referring the reader instead to [TQ] . 

The remainder of Gowers' argument in ^H] is concerned with how to use the fact 
that a is approximately equal to a higher- dimensional linear function to again deduce a 
density increment of A on some sub-progression. This is again done mainly by Weyl's 
theory of uniform distribution; however in [22] an alternate argument was developed, 
which is based on locating a primitive F to a. This argument closely mimics the one 
given in the one- dimensional case when a(h) ~ ah + (3; however, there is an additional 
difficulty in the higher- dimensional case, namely that not every linear function has a 
primitive; instead, only the "self-adjoint" linear functions do. This has to do with the 
fact that quadratic forms in higher dimensions (the analogue of quadratic polynomials 
in one dimension) are associated to symmetric matrices rather than general matrices. 
Fortunately, one can show that the function a does indeed obey the required symmetry 
property. Rather than give the precise statement and proof of this assertion in detail, 
we sketch how it works in a model case. Here we consider solutions to the equation 
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([4.4)1 . but now x, h take values in a vector space V := (Z/pZ) n , and a(/t) is now a linear 
transformation from V to Z/pZ. By arguing as before, we conclude that a(h) = ah + (3, 
where (3 is now a linear transformation from V to Z/pZ, and a is a bilinear form from 
V x V to Z/pZ. Inserting this back into (|4.5jl . we obtain 

a(/i, s) + (3s = <f>(x + h + s) — <f>(x + h) — <f)(x + s) + <f>(x) for all x,h,s G V. 

Now we proceed a little differently to before. If we replace h, s by h + u, s — u and 
subtract, we obtain 

a(h + u, s — u) — a(h, s) — (3u = —4> u (x + h) — 0_ u ( x + s) for all x, h,s,u G V. 

If now we replace x, h, s by x + 1, h — t, s — t and subtract, we obtain 

a(h + u — t, s — u — t) — a(h — t,u — t) — a(h + u,s — u) + a(h, s) = for all h, s,u,t G V. 

Using the bilinearity of a, this simplifies to 

a(t, u) — a(u, t) = for all u, t G V 

which shows that a is symmetric. In particular this allows us to construct a prim- 
itive F by the formula F( ), and the previous argument now proceeds 
as before. Back in the original setting of a function b with large U 3 norm, an analo- 
gous argument allows us to locate a "generalized quadratic polynomial phase function" 
x{x) := e p (P(x)) such that (b,x) is somewhat large; see [22] for a rigorous statement 
and proof of this "inverse theorem for the II s norm". (Interestingly, there are some 
closely related results arising from ergodic theory; see |2H|, [4"T]). 

This concludes our discussion of Gowers' proof of Szemeredi's theorem for progressions 
of length 4; the argument also extends to higher k (see though with some non-trivial 
additional difficulties; also, it is not at present clear whether the higher U d norms also 
enjoy an inverse theorem. We now briefly discuss another proof of this theorem, which 
extends the energy increment proof for progressions of length three discussed earlier. 
There are many proofs in this spirit, starting with the work of Furstenberg JU], ^T] 
(and a related energy-incrementation argument also appears in JHH]); we shall loosely 
follow the version of this argument from j^T] . For sake of simplicity we shall confine our 
discussion to the k = 4 case only. 

As it turns out, large portions of the energy increment proof generalize without dif- 
ficulty to obtain progressions of arbitrary length. The main difficulty is to replace the 
concept of an (5, i^)-almost periodic function with a "higher order" generalization. The 
definition given in Definition 12 . til relies too heavily on linear phase functions, and we have 
already seen some difficulties in extending that concept to higher orders; for instance, 
we still do not have a satisfactory theory of what a "quadratically quasiperiodic func- 
tion" should be, although there are some very promising developments in the ergodic 
theory of nilfactors (see e.g. jH], |1E]) which should shed light on this question 

very soon. However, it is well understood by now how to generalize the more general 
concept of an almost periodic function. In ergodic theory, a function / in a measure- 
preserving system (X, £>, fi, T) is said to be almost periodic if the orbit {T n f : n G Z} is 
precompact, and in particular can be approximated to arbitrary accuracy by a subset 
of a finite-dimensional space. In the discrete setting of Z/iVZ, every function is periodic 
of order N and is thus, technically speaking, every function is almost periodic. However 
one can still extract a useful concept of almost periodicity by making the concept of 
"precompact" more quantitative. One such way of doing so is 
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Definition 4.1 (Uniform almost periodicity norms). [41J If A is a shift-invariant Banach 
algebra of functions on Zjy, we define the space Z7AP[A] to be the space of all functions 
F for which the orbit {T n F :n£Z} has a representation of the form 

T n F = ME(c n;h g h ) for all n e Z N (4.14) 

where M ^ 0, H is a finite non-empty set, g = (gh)heH is a collection of bounded 
functions, c = (c nt h)n&z N ,h&H is a collection of functions in A with Hc^U^ ^ 1, and h is 
a random variable taking values in H. We define the norm ||-F||c/-ap[.a] to be the infimum 
of M over all possible representations of this form. 

The formula (j4.14j) is a quantitative assertion that the orbit {T n F : n G Z} can 
be represented efficiently by what is essentially a finite-dimensional approxmation, and 
is thus an assertion of precompactness "relative to A". It can be shown (see |41j ) 
that LM.PL4] is a shift-invariant Banach algebra. If we let A be the trivial Banach 
algebra of constant functions (so that the c nt h are constants, with Hc^Ha = \c nt h\) then 
we abbreviate LM.PLA] as UAP 1 , and refer to functions with bounded UAP 1 norm as 
linearly uniformly almost periodic. For instance, one can show that any .fT-quasiperiodic 
function is linearly uniformly almost periodic, with a UAP 1 norm of at most K. In 
particular, linear phase functions are linearly uniformly almost periodic, with a UAP 1 
norm of exactly 1. 

One can then define the space UAP 2 := UAPlUAP 1 ] of quadratically uniformly 
functions, which are roughly speaking the space of functions which are almost periodic 
relative to the linearly almost periodic functions. For example, consider the function 
f(x) := 6n{x 2 ). This function is very far from being linearly almost periodic - in the 
sense that the UAP 1 norm is huge - because the translates T n f(x) = cn{x 2 + 2nx + n 2 ) 
are all quite distinct and cannot efficiently be expressed as linear combinations of a 
small number of functions. On the other hand, we may write T n f = c n g where g :— f 
and c n (x) = e^^nx + n 2 ), and note that each c n , being a linear phase function, lies in 
UAP 1 with small norm. Thus this function is quadratically almost periodic; in fact, it 
lies in UAP 2 with norm 1. The property of being quadratically almost periodic strictly 
generalizes the concept of a quadratic eigenfunction in ergodic theory; see e.g. |lHj 
for further discussion. 

The concept of quadratic almost periodicity (bounded UAP 2 norm) is in many ways 
dual to that of quadratic uniformity (small U 3 norm). We present three results sup- 
porting this claim. The first is the duality inequality 

\(f,F)\ < \\f\\ u3 \\F\\ UAP 2, 

which can be proven by a simple Cauchy-Schwarz argument, see |41j . Secondly, if / is 
such that ll/Hc/3, ||/||l°° ^ 1) and we let Vf denote the dual function 

Vf(x) := E{f(x + a)f(x + b)f{x + c)f{x+a+b)f{x+a+c)f{x+b+c)f{x + a + b + c)\a,b, c e 1 N ) 

then Vf lies in UAP 2 with a norm of at most 1; again, see |41j . Furthermore, we have 
the correlation identity 

By using these dual function to replace the role of linear (or quadratic) phase functions, 
one can obtain the following variant of Proposition 12.111 

Proposition 4.2 (Quantitative Koopman-von Neumann theorem). [3T] Let F : IR + x 

R + x M + — > R + be an arbitrary function, let < a < 5 ^ 1, and let f be any bounded 
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non-negative function on Z/iVZ obeying (|2.1|1 . Then there exists a quantity < K ^ 
C(F, 5, cr) and a decomposition f = g + b, where g is bounded, non-negative, has mean 
E(g) = E(/) ; and we have the bound 

\\b\\ u3 ^F(S,a,K). 

Furthermore we have an additional decomposition g = g + e with g non-negative and 
the bounds 

UWuap* < K; \\e\\ L 2<^a. 

The proof of this Proposition proceeds by an energy incrementation argument very 
similar to Proposition I2.11| one begins with the trivial splitting / = E(/) + (/ — E(/)), 
and whenever the bad function b fails to be quadratically uniform, one uses the dual 
function T>b (which is quadratically almost periodic) to refine the cr-algebra used to 
construct the good function g, thus increasing the energy of g by a non-trivial amount. 

By combining this with the generalized von Neumann theorem in Lemma \3A\ we can 
conclude the proof of Szemeredi's theorem in this case once we show the analogue of 
Lemma 12.71 

Theorem 4.3 (Almost periodic functions are recurrent). Let g, g be non-negative bounded 
functions such that we have the estimates 

5 2 

o-o (4.15) 

" y y " 4096 v ; 

E^IZjv) ^ 5 (4.16) 
\\g\\ UAP 2 < M (4.17) 

for some < 5, M < oo. Then we have 

E(g(x)T n g{x)T 2n g{x)T 3n g(x) \x, r G Z N ) ^ c{5, M) (4.18) 

for some c (S, M) > 0. 

The proof of this theorem is the most difficult component of the argument; it uses 
the uniform almost periodicity control on g to "color" the orbit of T n g and hence T n g, 
and then invokes the van der Waerden theorem |44J to extract arithmetic progressions 
from g. As such, this part of the argument can be considered to be more combinatorial 
than ergodic or analytic in nature. 



5. Progressions in the primes 

There are many questions concerning the distribution of the prime numbers (and of 
various configurations of prime numbers), which has motivated a large portion of an- 
alytic number theory. One of the basic results in the subject is of course the prime 
number theorem, which asserts that the number of primes between 1 and asymptot- 
ically approaches N/ log N as A^ — > oo, or in other words 

Af 

#{1 ^ n ^ N : n is prime} = — (1 + o(l)), 

log N 

where we use o(l) to denote a quantity which goes to zero as A^ — >• oo. 

It is convenient to normalize the prime number theorem in a different form. Define 
the von Mangoldt function A : Z + — » R by setting A(n) := logp whenever n = p 3 is 
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a power of a prime p for some j ^ 1, and A(n) = otherwise; the significance of this 
function to number theory lies in the identity 



logn = ^A(rf) (5.1) 



d\n 

for all integers n (where the sum is over all integers d dividing n), which is a restatement 
of the unique factorization theorem. The Von Mangoldt function is essentially supported 
on the primes (there are also the squares and higher powers of primes, but they are 
extremely sparse, and in practice are completely negligible, contributing only to the 
o(l) error terms). Then the prime number theorem is easily seen to be equivalent to 

i A(n) = l + o(l). 

The expression on the left-hand side can be viewed as an average or expectation for A; 
we shall emphasize this probabilistic (or ergodic) perspective by writing it as E(A(n) : 
1 ^ n ^ N); more generally, we write E(/(n)|n £ A) for pn ^2 neA f (n) whenever A is 
a finite set. Thus A has an average value of 1 + o(l). The error can be improved; for 
instance the famous Riemann hypothesis is equivalent to the claim 

E(A(n)|l < n ^ N) = 1 + 0(N^ 1 ^ 2 log 2 N). 

However the improved error estimates are not central to the results we shall discuss here, 
which are in some sense more focused on the main term in such estimates involving the 
primes. 

Now we consider how to count other patterns inside the primes. One of the oldest (and 
still unsolved) problems in the field is the twin prime conjecture, which asks whether 
there are an infinite number of primes p such that p + 2 is also prime. This would be 
implied by the statement 

liminf E(A(n)A(n + 2) : 1 < n ^ N) > 0. 

N— voo 

is non-zero for infinitely many N. In fact Hardy and Littlewood made the stronger 
conjecture, the Hardy-Littlewood prime tuple conjecture |2Sj, which would imply the 
twin prime conjecture, and would indeed verify the stronger estimate 

E(A N (n)A N (n + 2) : 1 ^ n ^ N) = B 2 + o(l) 

where B 2 is the Twin prime constant 

F(n, n + 2 coprime to p\n £ Z/pZ) 



B 2 :=IIj 

V 

= 2fl 



?{n coprime to p\n £ Z/pZ)P(n + 2 coprime to p\n £ Z/pZ) 
Pip - 2 ) 



(p — l) 2 

p>3 Ky ' 

= 1.32032. . . 

A related problem is the strong Goldbach conjecture - whether every even number (larger 
than 4) can be written as the sum of two primes; this is essentially the same as asking 
whether 

E(A(m)A(n 2 ) : 1 < m, n 2 < N; n x + n 2 = N) 
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is non-zero for all even integers N. The Hardy-Littlewood prime tuple conjecture here 
would imply that 

E(A(m)A(n 2 ) : 1 ^ ni ,n 2 ^ N;m + n 2 = N) = G 2 {N) + o(l) 

where 

q tfq\ .— TT n 2 coprime to p\ni, n 2 G Z/pZ; m + ra 2 = iV) 

rij=i ^( n j coprime to p|ni, n 2 G Z/pZ; rii+ n 2 = N) 



which vanishes when N is odd, and is equal to 

p 

p-2 



G 2 (N) = B 2 11 ^-1^ J B 2 >0 



p\N;p>3 ■ 

when N is even. Thus the prime tuple conjecture would imply the strong Goldbach 
conjecture for sufficiently large N. 

The weak Goldbach conjecture, which is essentially proven (thanks primarily to the 
work of Vinogradov 0B]), asserts that every odd number N larger than 5 can be written 
as the sum of three primes. (By "essentially proven" I mean that this conjecture has 
been verified for N $5 10 17 and also rigourously proven for N 10 43000 ). This is 
essentially asking for the quantity 

E(A(m)A(n 2 )A(n 3 ) : 1 < n h n 2 , n 3 ^ A; m + n 2 + n 3 = N) 

to be positive for all odd integers N. The work of Vinogradov implies 

E(A(ni)A(n 2 ) A(n 3 ) : 1 ^ m, n 2 , n 3 < N;m + n 2 + n 3 = N) = G 3 (N) + o(l) 

where 

^ tt P(ni, n 2 , n 3 coprime to p|ni, n 2 , n 3 G Z/pZ; n x + n 2 + n 3 = N) 

p rij=i ^( n j coprime to p|ni, n 2 , n 3 G Z/pZ; ni + n 2 + n 3 = AT) 

This quantity is positive and bounded away from zero for all odd A"; thus Vinogradov's 
work implies the weak Goldbach conjecture for all sufficiently large N; to resolve the 
remaining cases it is thus natural to try to sharpen the o(l) error term. (For instance, 
the weak Goldbach conjecture is known to be true if one assumes the generalized Rie- 
mann hypothesis, which is extremely useful in improving these error terms). One can 
generalize Vinogradov's result to sums of k primes for any k ^ 3; but as we shall ex- 
plain later, the k = 2 case is much more difficult and well beyond the reach of existing 
techniques. 

Now we turn to arithmetic progressions in the primes. In 1933 van der Corput [13] (see 
also 7\) established that the primes contain infinitely many arithmetic progressions of 
length 3; indeed we know the significantly stronger statement that the Hardy-Littlewood 
conjecture holds in this case, or more explicitly that 

E(A(n)A(n + r)A(n + 2r) : 1 < n, r < N) = C 3 + o(l) (5.2) 
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where 



c 3 =n 



P(n, n + r, n + 2r coprime to r G Z/pZ) 



n?=o + J r coprime to p\n, r G Z/pZ) 




More generally, the Hardy-Littlewood prime tuple conjecture implies that 
E(A(n) . . . A(n + (Jfe - l)r) : 1 «C n, r < A) = C7 fc + o fc (l) 



(5.3) 



for all k ^ (with the error term 0^(1) depending on k), where Ck is the constant 



which is explicitly computable for each k. The case k = is trivial, the cases k = 1, 2 
follow from the prime number theorem, and the case k = 3 is just ()5.2|) . More recently, 
we have the following results: 

Theorem 5.1. j2H|, |2H| 27ie conjecture ()5.3|) is a/so true /or k = 4 (so there are 
infinitely many prime arithmetic progressions of length 4)- Furthermore, for all k ^ 
we /jane 



for some explicit constant > (which is unfortunately much smaller than Ck)- This 
weaker statement still suffices to establish infinitely many prime arithmetic progressions 
of length k. 

All of these results have the flavor of "establish bounds or asymptotics for multi- 
linear averages of A". However, some are significantly harder than others, depending 
on the exact structure of the multilinear average involved. As mentioned earlier, the 
situation has some parallels with the linear, bilinear, and trilinear Hilbert transform in 
harmonic analysis; while these expressions are formally very similar in structure, the 
analytical treatment of each one in the sequence has proven to be significantly harder 
than the previous one, for instance no LP estimates for the trilinear Hilbert transform 
are currently known. A certain subclass of these multilinear averages (the "rank one" 
averages involving three or more copies of A) can be treated by Fourier methods; this 
includes Vinogradov's theorem and van der Corput's theorem, and see also [2] for fur- 
ther discussion. However, it is by now well established that these techniques cannot 
directly extend to handle other multilinear averages. The k = 4 result in Theorem 15.11 
requires a "quadratic" generalization of Fourier analysis, pioneered by Gowers [ISj , but 
still in a very early stage of development. The higher cases k ^ 5 could in principle be 
treated by polynomial Fourier analysis, of the type developed in [T2\; this would likely 
establish ()5.3j) for all k, this project is currently a work in progress with the author and 
Ben Green. Instead, we use an alternate argument based on ergodic theory which is 
technically simpler but only gives the weaker result (J5.4)) . 

There are two main strategies to obtain progressions: 

• (Uniformity strategy) Attempt to approximate A by some averaged version 
E(A|£>) of itself, in such a manner that A — E(A|i3) is uniform of the correct order 



P(n, . . . , n + (k — l)r coprime to p\n, r G Z/pZ) 



v 




E(A(n) . . . A(n + (k — l)r) : 1 ^ n, r ^ N) > c k + o k (l) 



(5.4) 
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(linearly uniform for k = 3, quadratically uniform for k — 4). This requires one 
to estimate exponential sums such as A(n)e(n8) or A(n)e(P(n)) where 
P is a polynomial or "generalized polynomial). 
• (Szemeredi strategy) Attempt to leverage Szemeredi's theorem (or in the case 
of progressions of length three, Roth's theorem) in order to obtain arithmetic 
progressions regardless of whether A is uniform or not. 

In the case of progressions of length three, the uniformity strategy (more commonly 
known in this context as the Hardy-Littlewood circle method) was developed far earlier 
than the Szemeredi strategy. It gives sharper results (in particular, it yields the asymp- 
totic ()5.3J) ). but is technically more difficult to implement. We now briefly discuss each 
of these strategies in turn. 

6. The uniformity strategy 

We begin by discussing the uniformity strategy. We shall eschew the traditional 
framework of the Hardy-Littlewood circle method (which is only effective for the k = 3 
case) and present this strategy in a language which more easily lends itself to general- 
ization to higher k. 

The circle method relies on Fourier analysis on the integers Z (so that the dual group 
is the unit circle S 1 , hence the terminology "circle method"). For us it will be slightly 
more convenient to work in the cyclic group Z/NZ, which is self-dual. To simplify the 
exposition we shall pretend that A is actually a function on Z/iVZ rather than Z. In 
practice one would have to justify this by a truncation trick, for instance cutting off 
A to {1, ... , N/3} (possibly using a smooth cutoff function) and then transferring this 
to Z/iVZ; this type of "transference" is quite standard and introduces no substantial 
difficulties, and so we shall gloss over this entire issue. 

Using the above "cheat" , we can morally rewrite ()5.3j) as 

E(A(n) ...A(n+(k- l)r) :n,re Z/NZ) = C k + o k (l). 

Let us first discuss the k = 3 case (i.e. (|5.2|) ). which with our new cheat becomes 

E(A(n)A(n + r)A(n + 2r) : n, r G Z/NZ) = C 3 + o(l). 

. The strategy is to use some variant 7 of Proposition 12.31 More specifically, we would 
seek to approximate A by an averaged version E(A|£>) such that we have a uniformity 
estimate 

||(A-E(A|S)) A || 00 = o(l), (6.1) 
which (by a suitable variant of Proposition 12. 3|) should imply 

E(A(n)A(n + r)A(n + 2r) : n, r G Z/NZ) 

= E(E(A|8)(n)E(A|£)(n + r)E(A|#)(n + 2r) : n, r G Z/NZ) + o(l) 

and then one only has to prove (|5.2jl for the averaged function E(A|jB): 

E(E(A|£)(n)E(A|i3)(n + r)E(A|£)(n + 2r) : n,r G Z/NZ) = C 3 + o(l). (6.2) 

7 Strictly speaking, one has to replace this Proposition by a weighted variant to cope with the fact 
that A is not a bounded function. This can be done by using a suitable weight function A# which is 
adapted to "almost primes" , and which among other things obeys a good Fourier restriction theorem 
which allows one to transfer Proposition ^ . 31 to the weighted setting. See |23], |22| f° r further discussion 
of this issue. 
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The first issue is to decide what function E(A|£>) to use as the approximant to A. In 
order to establish (j6.2j) we would like E(A|Z3) to have low "complexity" - in particular, 
it should be far more regular than A itself - but not so simple that the approximation 
to A is poor in the sense that (j6.1|) fails. 

Let us understand what (jfi.l)) means. We can rewrite it as 

E(A|B)(0 = A(f) + o(l) for all £ G Z/NZ 

or in other words 

E(E(A\B)(n)e N (-n£)\n G Z/NZ) = E(A(n)e N (-n£)\n G Z/JVZ)+o(l) for all f G Z/NZ. 

(6-3) 

This gives us some clues as to what kind of approximation E(A|B) we should choose. 
For instance, setting £ = and using the prime number theorem E(A(n)|n G Z/NZ) = 
1 + o(l), we see that we need E(A|£>) to obey the condition 

E(E(A|B)(n)|n G Z/NZ) = 1 +o(l). 

This suggests using the constant function 1 (or perhaps E(A) = 1 + o(l)) as the ap- 
proximating function E(A|Z3); this corresponds to interpreting B as the trivial cr-algebra 
£>i := {0,Z/7VZ}. For this approximation, the left-hand side of ()6.2|) is very easy to 
compute, indeed it is just l + o(l). Unfortunately, while ()6.3|) is true for this approxima- 
tion when £ = 0, it is not true for some other values of £. Take for instance £ = [7V/2J . 
Then n^) is essentially +1 when n is even and —1 when n is odd, and so if E(A|£>i) 
were constant then the left-hand side of ()6.3j) would vanish. On the other hand, the 
right-hand side of (jfi.3j) is large and negative, because A is overwhelmingly supported 
on the odd numbers rather than the even numbers. Thus we must modify the approx- 
imant E(A|£>!) to reflect this "bias" that A has towards being odd. The easiest way 
to fix this is to refine the cr-algebra B\ to include the odd and even numbers. In other 
words, if we now let £> 2 be the cr-algebra generated by B\ and the residue classes mod 2 
(i.e. the odd and even numbers), then we can use E(A|i?2) as our approximant. By the 
prime number theorem (and the fact that almost all primes are odd), we know that this 
function is 2 + o(l) on the odd numbers and o(l) on the even numbers. One can now 
also check that (jd3j) is now true when £ is close to zero or close to N/2. Furthermore, 
the left-hand side of (|fi.2j) is quite easy to compute, it is 

P(n, n + r, n + 2r coprime to 2\n, r G Z/2Z) 



FT? P(n + jr coprime to 2|n, r G Z/2Z) 



+ o(l) = 2 + o(l). 



Unfortunately, there are still some further Fourier-analytic biases in A which are not 
detected by the approximation E(A|2i?2), for instance the fact that A is concentrated in 
the residue classes 1 mod 3 and 2 mod 3 and nearly vanishes on the residue class 
mod 3 will cause the Fourier coefficients of A to be rather large for £ near N/3 and 2N/3, 
whereas E(A|£>2) is uniformly distributed among all three residue classes mod 3 and 
thus has a negligible Fourier coefficient at those frequencies. One can address this failure 
of (jfj.Hjl by refining the approximation E(A|^) further to E(A|jSa), where B3 is the a- 
algebra formed by adjoining the residue classes modulo 3 to B2 (or in other words, B3 is 
the a-algebra generated by the residue classes modulo 6). Then one can show that (jfi.Hj) 
now holds for all £ near multiples of N/6. Furthermore, one has E(A|i3 3 )(7?,) = 3 + o(l) 
when n is coprime to 6 and E(A|£> 3 ) = o(l) otherwise; this follows from the prime 
number theorem combined with Dirichlet's theorem, which asserts that A is uniformly 
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distributed among those residue classes modulo m which are coprime to m, as long as 
TV is sufficiently large compared to m (here we take m — 6). Because of this, one can 
compute (using the Chinese remainder theorem) that the left-hand side of (|6.2|) is now 

nP(n, n + r, n + 2r coprime to p\n, r G Z/2Z) . . 3 
- p-2 - i + o(l) = - + o(l). 
p =2 3 rij=o + 3 r coprime to p\n, r G Z/2Z) 2 

One can of course continue in this fashion. Let w = w(N) be a slowly growing 
function of N, e.g. w = log log N, and let W be the product of all the primes less 
than w. We let B w be the cx-algebra formed by the residue classes modulo W, then 
we use K(A\B W ) as our approximant. From Dirichlet's theorem, one can show (if w is 
sufficiently slowly growing in N) that E(A\B w )(n) = + o(l) if n is coprime to W, 
and ~K(A\B w )(n) = o(l) otherwise; here 0(W) is the Euler totient function of W, i.e. the 
number of integers in {1, . . . , W} which are coprime to W. From the Chinese remainder 
theorem, the left-hand side of ()6.2|) can be computed as 

(n,n + r,n + 2r coprime to p\n, r G Z/2Z) 



n 



n^ =0 F(^ + jr coprime to p\n,r G Z/2Z) 



o(l) = C 3 + o(l) 



since the product is convergent and w tends (slowly) to infinity. Thus it only remains 
to demonstrate 1)6.3)1 . This would be easy if w was extremely large (e.g. if w = VN, 
then the sieve of Eratosthenes essentially ensures that A = K(A\B W ), but unfortunately 
the error terms blow up long before w reaches this level. Nevertheless, this 'W-trick" of 
removing all the structure from A associated to those primes less than w does make the 
task of (J6.3)) much easier. Essentially, it means that ()6.3)) is automatically true whenever 
£ is a "major arc frequency", which roughly means that £ ~ aN/q for some integers a, q 
with q $C w. It thus remains to prove f)6.3|) when £ is a "minor arc" frequency, which 
roughly means that g£ is not close to zero modulo for any q ^ w. In such a case, 
the left-hand side of ()fi.3j) is very small (by construction of K(A\B W ), and one is reduced 
to establishing enough cancellation in the sum J2 n <N ^■( ri ) e A r ( — n to ensure that it is 
o(N). (Note that the trivial bound coming from using absolute values and the prime 
number theorem is O(N)). 

To do this, one must finally use some deeper structure of the function A(n), beyond the 
prime number theorem and Dirichlet's theorem. This was first done by Vinogradov, with 
later simplifications by Vaughan and other authors; we present a vastly oversimplified 
sketch of the main idea here. The starting point is the identity (|5.ip . Solving for n we 
obtain the formula 

c,d:cd=n 

where //(d) is the Mobius function, defined as (— l) m if d is the product of m distinct 
primes, and equal to otherwise. Thus we can write 

Y A(n)e N (-n£) = ^ logc//(d)ejv(-cd£). 

n<N c,d:cd<N 

The idea is now to view this as a bilinear form acting on the functions log and fi, 
given by the matrix coefficients cn{— cd£). The hypothesis that £ is not "minor arc" 
leads to some almost orthogonality in this matrix (which can be made explicit by the 
TT* method), which after some care can eventually lead to the o(l) gain. (This is an 
oversimplification because the portions of this expression when c or d is small require 
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some additional attention, including a quantitative version of Dirichlet's theorem known 
as the Siegel-Walfisz theorem; we will not discuss these rather lengthy issues here). This 
can eventually be used to establish Van der Corput's theorem ([5.2)1 . 

It turns out that the same ideas can also be pushed (with several additional difficul- 
ties) to give the k = 4 case of ()5.3j) : it is not yet known whether the arguments can 
be pushed to general k. By using a result similar to Lemma 13 A\ as a substitute for 
Proposition EiHl it suffices to find an approximation E(A|23) for A such that 

E(E(A|S)(n)E(A|B)(n + r)E(A\B)(n + 2r)E(A\B)(n + 3r) : r G Z/NZ) =C 4 + o(l) 

(6.4) 

and 

||A-E(A|B)|| c/3 =o(l). (6.5) 

As in the k = 3 case, we again invoke the "W-trick" and set B = B w where w is again a 
slowly growing function of N. When one does so, (|6.4J) is easy to establish, but ()6.5|1 is 
still quite difficult. Expanding out the U 3 norm directly gives rise to expressions which 
are about as complicated to estimate as the original expression in (|5.3j) . However, one 
can proceed instead by using the inverse theory used in Gowers' proof of Szemeredi's 
theorem for progressions of length 4. The idea is to assume that ||A — E(A|jB)||{/s is 
large, say larger than some 5 > 0, and arrive at a contradiction. One can repeat the 
analysis in Gowers' arguments (though one has to introduce weights to deal with the 
fact that A is not bounded) to eventually conclude that 

E((A(n) - E(A\B)(n))e N (P(n))\n G Z/NZ) ^ c(5) > 

for some "generalized quadratic phase function" P(n); we shall gloss over exactly what 
"generalized quadratic phase function" means here but one should think of P as being 
like a quadratic polynomial. Thus to conclude the proof, one needs to extend the linear 
uniformity estimate ()6.3|) to the claim that 

E(E(A|B)(n)ejv(P(n))|7i G Z/NZ) = E(A(x) e N (P(n))\n G Z/NZ) + o(l) 

for all generalized quadratic phase functions P. It turns out that once again one can 
divide into the case when P is "major arc" - all the non-constant coefficients of P are 
essentially rational multiples of N with small denominator, and when P is "minor arc" 
- when at least one of the coefficients behaves "irrationally". The major arc case is 
again easy, while the minor arc case turns out to be again amenable to the methods 
of Vinogradov and Vaughan. Here the point is to establish some orthogonality in the 
matrix coefficients e^{P{cd)). See [23] for further details. 

7. The Szemeredi strategy 

In principle, the uniformity strategy discussed above should in fact prove ()5.3j) for 
all k. However, at present we are restricted to k ^ 4 because the inverse theorem that 
passes from large U k ~ 1 norm to correlation with a generalized polynomial phase function 
of order k — 2 has only been rigorously proven for k ^ 4. (The analysis in strongly 
suggests that this inverse theorem should in fact extend to higher k\ this is a current 
work in progress with the author and Ben Green). In particular, while it is conjectured 
that we in fact have 

|| A - E(A|B t0 )|| { 7fc-i = ojfc(l) (7.1) 
for all k (which would certainly imply (|5.3J) ). this estimate has not yet been rigorously 
established. 
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Nevertheless, one can still achieve the weaker statement (|5.4j) by using ergodic theory 
arguments to locate another cx-algebra B (which could be somewhat finer than B w ) for 
which the analogue of (|7.1|) holds. To finish the proof of (|5.4|) . it then remains to show 
that 

E(E(A|B)(ra) . . . E(A\B){n + (k - l)r) : 1 < n, r < N) > c k + o k (l). (7.2) 

Unfortunately, the structure of the algebra B is much less well understood than B w , 
and as such the function E(A|jB) is also not very well understood. However, being a 
conditional expectation of A, it is still non- negative, has the same mean (i.e. 1 + o(l)) 
as A. Crucially, one can also establish that E(A|$) is also bounded by 0(1). By the 
third version of Szemeredi's theorem, these three facts imply (|7.2|) . 

A prototype of this argument is the proof of Theorem II .41 in [2T)j . which used Fourier 
analytic methods (but with ergodic ideas lurking under the surface), and as such was 
limited to the k = 3 case. This argument was then simplified and extended in |2*4"j : 
simultaneously, in the Fourier-analytic components were replaced with ergodic the- 
ory arguments which could then extend to general k. Here we shall begin by discussing 
the general ergodic theory argument, and return to briefly discuss the earlier Fourier- 
analytic arguments at the end of this section. 

One important technical problem that needs addressing is that the function A is not 
bounded, which means that much of the analysis in previous sections, strictly speaking, 
does not apply. This is essentially equivalent to the fact that the primes have asymptotic 
density zero. However, one can resolve this problem by bounding A not by a bounded 
multiple of the constant function 1, which is not possible, but instead by a bounded 
multiple of another function v which resembles A but is much easier to work with 8 . This 
corresponds to viewing the primes not as a (sparse) subset of the integers, but rather 
as a subset of the set of almost primes, which is much more tractable than the primes 
to study, and with the property that the primes have positive relative density inside 
the primes. One byproduct of this approach is that, because it uses very little about 
the primes other than this positive relative density, it in fact implies a stronger result, 
namely that all subsets of the primes with positive relative density must necessarily 
contain arbitrarily long arithmetic progressions. 

Informally, the idea is as follows. Let P be the set of prime numbers between N/2 
and N. The sieve of Eratosthenes shows that P consists precisely of those integers in 
{N/2, . . . , N} which are coprime to all primes less than yN. Motivated by this, let us 
define the partially sifted set Pr to be those integers in {N/2, . . . , N} which are coprime 
to all primes less than R, where 1 ^ R ^ y/~N is a parameter. Thus as R increases to 
VN, Pr decreases until it becomes P. The first few sets Pr are easy to understand, 
for instance Pi is simply the odd numbers from N/2 to N . In particular, any statistic 
involving Pr (e.g. counting how many arithmetic progressions of length k are contained 
in Pr) is quite easy to compute to high accuracy when R is small. However, the task 
becomes increasingly difficult when R gets large. The vast and well-developed topic 
of sieve theory - a key component of analytic number theory - is devoted to questions 
like this; while this theory is too complex to be surveyed here, let us oversimplify one 
of the basic results in that field, namely the fundamental lemma of sieve theory In our 
notation, this lemma roughly speaking asserts that that one can compute the statistics 

8 As before we are ignoring some details concerning how one embeds A inside Z/iVZ; also, it turns 
out to be convenient to "factor out" the initial cr-algebra B w by passing to a single atom, such as the 
residue class 1 mod W; we ignore these minor technical issues here. 
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of P R as long as R is a sufficiently small power of N. For instance, one can accurately 
count the number of arithmetic progressions in P R of length k if R is less than N l l 2k . 
An informal probabilistic argument suggests that 

where we use X ~ K to denote equivalence up to constants (i.e. C^Y ^ X ^ CF). A 
famous theorem of Merten in fact gives the more precise asymptotic 

N e~ 7 

as long as R is much less than yN but goes to infinity as N — > oo. Here 7 = 0.577 . . . 
is Euler's constant. Comparing this with the prime number theorem 

l p l = < 1 + <<I3i]v 

we see that P will have a relative density |P|/|Pr| bounded away from zero as long as 
we set R to equal a small power of N, say R = N £ for some fixed e (this e will eventually 
depend on k; in [2H] it is e = k2 l+4. )- 

A natural choice for the weight function v would then be logRlp R ; this function 
would thus be normalized to essentially have mean 1, and A would be dominated by 
a bounded multiple of v. For technical reasons, however, the function lp a is a bit too 
"rough" to serve as a good weight function, and it is better to use a slightly "smoother" 
variant of this function, namely the truncated divisor sums studied by Goldston and 
Yildirim [13] These are formed by replacing the von Mangoldt function 

A(n) = ^2fi{d) log^ 

d\n 



with the variant 



A R (n) :=$>(rf)0og^)- 



d 

d\n 



where x + := max(x, 0) is the positive part of x. One can easily verify that A^ is equal 
to logi? on the set of Pr, and can thus be thought of as the function \ogRlp R with an 
additional "tail" . The advantage of working with A# instead of Pr is that A^ is easily 
expressed as a linear combination of the functions ld\ n , i-e. the characteristic functions of 
the residue class eK. Moreover, the coefficients //(d)(log 4)+ for this linear combination 
are supported on the small values of d, which are easier to control; this is roughly 
analogous in harmonic analysis to a function having Fourier transform supported on the 
"low frequencies" , which explains why such functions in number theory are sometimes 
referred to as being "smooth" . In particular, the work of Goldston and Yildirim showed 
that (providing R was a sufficiently small power of N) it was possible to accurately 
estimate such expressions as 

E(A R (n)A R (n + r) . . . A R (n + (k - l)r) \n, r G Z/NZ). 

We cannot directly use A R to dominate A, as it turns out to oscillate in sign; however 
this is easily fixed by using instead the function v{n) := ^^A 2 R {n). Actually, this is 
an oversimplification; in practice we need to localize n to an arithmetic progression of 
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spacing W and length equal to cN for a small multiple of N. After these adjustments, 
Goldston and Yildirim essentially showed that v was "pseudorandom" - that almost all 
the correlations of v were very close to 1 (a formal definition of this rather technical 
statement is in [23J). Another way of saying this is that v lies very close to 1 in certain 
"weak" norms (such as the Gowers uniformity norms). With this pseudorandomness 
property, it turns out that the weight v behaves very similarly to 1, thus for instance the 
generalized von Neumann theorem, Lemma f3.4[ can be extended to the case where / is 
bounded by the pseudorandom function v rather than the constant function 1 (although 
one has to accept some additional o(l) errors when doing so). See |2S| for details; the 
ideas here were initially motivated by similar arguments in the setting of hypergraphs 
by Gowers [T%| . 

We can now describe the proof of ()5.4)1 for general k. For sake of concreteness we shall 
restrict ourselves to the case k = 4, although the argument extends without difficulty 
to higher k. We shall use the machinery developed in the energy increment proof of 
Szemeredi's theorem in the k = 4 case. 

As discussed earlier, the objective is to locate a a-algebra B such that 

|| A - E(A\B)\\ uk -i is small (7.3) 

(where we shall be a bit vague as to what "small" means), and such that E(A|£>) is 
bounded. The choice B = B w , where w as before is a slowly growing function of N, will 
obey the second property (this is basically Diri chief's theorem), but it is unknown as 
to whether it obeys the first property. Nevertheless, we can proceed by a stopping time 
argument, somewhat similar to the Calderon-Zygmund stopping time arguments used 
in harmonic analysis, or the stopping time argument used in the proof of the Szemeredi 
regularity lemma. The key point is that if (|7.3|) fails for some algebra B, then by setting 
g to be the dual function of A — E(A|£>), 

g:=V(A-K(A\B)), 

then g will have a non-trivial correlation with A — E(A|£>): 

\(g,A-E{A\B))\ is large. 

Viewing this geometrically in the Hilbert space L 2 (Zjv), this means that A (now thought 
of as a vector) contains a non-trivial component which is orthogonal to the subspace 
L 2 {B) which the conditional expectation operator E(|£>) projects to, and which is also 
somewhat parallel to g. Thus if one defines B' to be the algebra generated by B and 
(suitable level sets of) g, we expect L 2 {B') to capture both L 2 {B) and g (or a vector 
very close to g). Putting this together, we expect A to be closer to the subspace L 2 {B') 
than to the smaller subspace L 2 (B); indeed, some applications of Cauchy-Schwarz and 
Pythagoras's theorem can be used to give an energy increment estimate of the form 

\\E(A\B')\\ 2 L ^ \\E(A\B)\\ 2 L2 + c (7.4) 

for some c > (which depends of course on the definitions of "small" and "large"). 

To summarize, whenever ()7.3j) fails, we can exploit this failure to enlarge the un- 
derlying a-algebra B in such a way that it collects more of the "energy" of A. We 
can now replace B by B' and iterate this procedure until ()7.3j) is finally attained. At 
first glance it seems that this algorithm could continue for quite a long time, since A 
has a large L 2 norm. Fortunately, though, it turns out that E(A|£>) remains uniformly 
bounded throughout this algorithm. This is because A is bounded by z/, and thus 
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E(A|0) is bounded by K(u\B). The latter function turns out to be bounded because v 
is pseudorandom (and thus very uniform), whereas B was essentially generated by dual 
functions (and thus highly non-uniform). Indeed, it turns out that even if one runs 
this algorithm for a large number of iterations, the bounds on E(i/|jB) only worsen by 
at most o(l). This crucial fact is one of the more delicate computations in j2H|, but it 
ultimately follows from the pseudorandomness information on v and an application of 
the Gowers-Cauchy-Schwarz inequality (J3.3j) . This boundedness of E(A|B) is required 
for two reasons: firstly, in order that Szemeredi's theorem (in its third formulation) can 
be applied to this function, and secondly it is used (in conjunction with 1)7.40 ) to show 
that the algorithm to find B halts after only a bounded number of iterations. 

We now briefly remark on the earlier k = 3 versions of the above argument, referring 
the reader to (20], [21] for further details. In that case, the notion of pseudorandom- 
ness of the dominating measure v was replaced by that of linear pseudorandomness or 
Fourier pseudorandomness, which basically asserts that all the Fourier coefficients of 
v — 1 were small. By Tomas-Stein restriction type arguments, this implies a certain 
Fourier restriction theorem for u, which can be used to develop weighted analogues of 
Proposition 12.31 adapted to v. One then runs the same argument as before, but this 
time the a- algebra B is more explicit: it is the algebra generated by the Bohr sets cor- 
responding to those frequencies where the Fourier transform of A is large. (Of course, 
the Hardy-Littlewood method already provides information as to where this Fourier 
transform is large; however the advantage of this argument is that it still works if A is 
replaced by any other function supported on a dense subset of the primes, whereas the 
Hardy-Littlewood method relies on the arithmetic structure on A and does not extend 
in this manner). Again, the pseudorandomness of v will ensure that K(u\B), and hence 
E(A|i3), is bounded, and one can then apply (the third version of) Roth's theorem to 
deduce Theorem 11.41 (Some further variations of this theme are pursued in )24j ) . 
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